1 strategies for collecting and preserving open access materials on the web william y. arms cornell...
Post on 20-Dec-2015
216 views
TRANSCRIPT
1
Strategies for Collecting and Preserving Open Access Materials on the Web
William Y. Arms
Cornell University
Federal Library and Information Center Committee
2
Open Access Materials on the Web
3
The Library of Congress:the Web Preservation Project
Library of Congress collects cultural and intellectual output of today for the benefit of future generations.
An ever-increasing amount of this material is born digital.
The library has:
• privileged legal position• generous public funding
... but cannot do everything!
Step 1: Open Access Materials on the Web
4
5
6
7
8
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automated processes
Approaches to Preservation of the Web
OPEN ACCESS
CLOSED ACCESS
9
Example: Web Preservation Project Pilot
• Small number of web sites nominated by selection officers. Three chosen for close study.
http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/
• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.
• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.
• Trial web site developed to evaluate user interfaces.
10
Example: The Internet Archive
11
Example: National Library of Australia
12
Example: National Library of Sweden
13
Selection and Collection
14
Collecting: Making a Snapshot
Web site
SnapshotDownload
Archive
A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.
15
Collecting: Periodic Snapshots
Web site Snapshot 1
Archive
At scheduled time intervals additional snapshots are made.
Snapshot 2
Snapshot 3
16
Selection Decisions
Which sites to collect
• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian
How often to make snapshots
• Monthly, weekly, or depending on circumstances
Which content to collect
• HTML pages only• Text and images only• Everything
17
Examples of Selection Decisions
Selection Frequency Content
Internet Archive bulk monthly HTML + images
Pandora selective varies all
Kulturarw3 bulk sweeps all
Web Preservation selective irregular all
18
Legal Issues
Legal position of archives that download open access materials is unclear
• Preservation is in the national interest
• See the discussion in The Digital Dilemma
• Crucial factor is economic impact on copyright owners
• Library of Congress has no special position except via copyright deposit
19
Legal Issues: Thoughts and Actions
• Presumption is that downloading open access materials is permitted by the publisher ....
... unless other indication given, e.g., robot exclusion using robots.txt file
• Different parties to consider
=> Library of Congress=> other national libraries=> partners of the Library of Congress and national libraries=> independent archives
U.S. Copyright Office has offered to help clarification
20
Access to Collections
21
Access: Analysis by Computer
Snapshot 1
Archive
Snapshot 2
Snapshot 3Analysis
by computer
22
Access: Analysis by Patron
Web site
Snapshot 1
Archive
Snapshot 2
Snapshot 3
Access 1
Access 2
Access 3Analysis by patron
Analysis by
computer
23
Access Decisions
Style of access
• Analysis of snapshot files by computer• Analysis of Web access version by patron
Editing
• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand
Policy
• Who has access to the collections?
24
Examples of Access Decisions
Style Editing
Internet Archive computer none
Pandora researcher some
Kulturarw3 ? ?
Web Preservation researcher some
25
Information Discovery
26
Options for Information Discovery
Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.
Options
• List of sites (e.g., Internet Archive)
=> Access by URL + date
• Automatic index (e.g., Web search engines)
• Catalog (e.g., Web Preservation Project)
=> Record for individual site or group of sites=> Access through library catalog
27
Information Discovery: Web Preservation Project
Procedure
• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.
Observations
• Catalog effort similar to other electronic files• Continual changes between snapshots• Some similarities to serials • No significant workflow difficulties
28
Storage
29
Storage: Preservation Versions
Snapshot 1 Access 1
Snapshot 1 Access 1
Snapshot 1 Access 1
Over time, other versions of a snapshot will be made for preservation.
30
Storage Decisions: Size
Each Web site will be stored many times
• Repeated snapshots
• Access versions
• Preservation versions
Saving space
• Many files are repeated (e.g., video clips)
• Storing a single copy saves space, but leads to more complex computer systems
• Compressing files save space, but leads to more complex computer systems
31
Very Rough Estimates of Size and Cost
Public web sites (OCLC, February 2000) 2,900,000
Library of Congress collects 1% 30,000
Average size of site 60 Mbytes
Size of 30,000 sites 1.8 terabytes
Storage requirements/year (monthly snapshot) 21.6 terabytes
Storage requirements (no duplicates) 5.0 terabytes
Cost per year ($25,000 per terabyte) $125,000
32
Storage Decisions: Identification
Identification of Web site
• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)
Identification and provenance of versions
• Web site identifier• Collection information (date, time, etc.)• History of changes
33
Archive
AccessionControl
Web CrawlerProcess
Catalog ExternalAccess
Workflow
snapshot
Analysis by patron
Analysis by computer
Web site
34
Preservation
35
Objective
Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.
What is preserved?
• Preservation of bits
• Preservation of content
• Preservation of experience
How is it used?
• Analysis by computer program
• Analysis by human researcher
• Viewed by human researcher
36
Process of Preservation
Version 1
Version 2
Version 3
Time 0
Time 1
Time 2
This process may be applied to either the snapshot or the access version.
37
Preservation: Refreshing
Each version is created from the previous by exactly copying the bits.
• Keeps the exact files for all time
• Preserves bits, and content but not always in an accessible form
• Later computers and software are unlikely to support today's protocols, formats, languages, etc.
Keeping the unedited snapshot files by repeated refreshing should be a basic part of any preservation strategy.
38
Preservation: Automatic Migration of Individual Files
As protocols, formats, languages, etc. become obsolete, convert individual files to new standards.
• Can be carried out automatically
• Preserves content and helps toward preservation of experience
• Effectiveness depends on availability of conversion tools and the complexity and quality of original source
• Migrated versions will steadily diverge from original
• Web sites will eventually cease to function
Automated migration of individual files is the basic technique for keeping web sites functional at moderate cost.
39
Preservation: Automatic Migration with Manual Editing
In conjunction with automatic migration, web sites are reviewed by a librarian and edited as necessary to preserve functionality
• The only method that can be expected to preserve the experience of using web sites
• Migrated versions will steadily diverge from original
• Some web sites will be impossible to edit without changing the experience
Manual editing is very expensive and is therefore suitable for only a small number of particularly important sites.
40
Acknowledgements
The members of the Web Preservation Project are:
Roger AdkinCassy AmmenWilliam ArmsAllene HayesMelissa LevineDiane KreshBarbara Tillett