digital preservation and the open web: a curatorial perspective

Digital Preservation and the Digital Preservation and the Open Web:Open Web:

A Curatorial PerspectiveA Curatorial Perspective

Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations

University of California, BerkeleyUniversity of California, Berkeley

Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006

OverviewOverview

A Brief Description of “The Web at Risk” A Brief Description of “The Web at Risk” ProjectProject– How it’s organized, who’s involvedHow it’s organized, who’s involved

Objectives of the ProjectObjectives of the Project– Preservation of the open WebPreservation of the open Web– Development of an open source “Tool Kit”Development of an open source “Tool Kit”

How it works, where it’s going, from a How it works, where it’s going, from a “special collections” perspective“special collections” perspective

The Web at Risk ProjectThe Web at Risk Project

3 year, 2.4 million dollar grant from the 3 year, 2.4 million dollar grant from the Library of Congress/National Digital Library of Congress/National Digital Information Infrastructure (NDIIPP)Information Infrastructure (NDIIPP)Coordinating Agency: Coordinating Agency: The California The California Digital Library Digital Library Primary focus on developing open access Primary focus on developing open access archiving tools that can be applied to any archiving tools that can be applied to any discipline with Web content worth keepingdiscipline with Web content worth keeping

Extensible, modular, easily configured to work with Extensible, modular, easily configured to work with existing technologies that are already in placeexisting technologies that are already in place

Project StagesProject Stages

Content Identification and SelectionContent Identification and Selection– Key issues for analysis, framework for sample crawls, working Key issues for analysis, framework for sample crawls, working

with collection partners, exploring extensibilitywith collection partners, exploring extensibility

Content AcquisitionContent Acquisition– Content Harvest and Acquisition, configuring of Content Harvest and Acquisition, configuring of Web CrawlerWeb Crawler, ,

AnalyzerAnalyzer, , Content User InterfaceContent User Interface (CUI), (CUI), Export/Import HandlerExport/Import Handler

Content RetentionContent Retention– Data model for Web Archive Digital Objects (WADO) testing Data model for Web Archive Digital Objects (WADO) testing

and modification, assessing the CDL Digital Preservation and modification, assessing the CDL Digital Preservation Repository for ingest and retentionRepository for ingest and retention

Partnership BuildingPartnership Building– Model Agreements for content retention, evaluate future steps, Model Agreements for content retention, evaluate future steps,

assess costs of sustaining a distributed approach to Web assess costs of sustaining a distributed approach to Web archivingarchiving

Partners in this NDIIPP GrantPartners in this NDIIPP Grant

Main Partners: Main Partners: – New York UniversityNew York University– University of North Texas, The LibrariesUniversity of North Texas, The Libraries– Texas Center for Digital KnowledgeTexas Center for Digital Knowledge

Technical Partners: Technical Partners: – UC San Diego Supercomputer CenterUC San Diego Supercomputer Center– Stanford University Computer Science Stanford University Computer Science

Department Department – Sun Microsystems, Inc.Sun Microsystems, Inc.

National Curatorial PartnersNational Curatorial Partners

Arizona State University Library and Arizona State University Library and ArchiveArchive

New York University Tamiment LibraryNew York University Tamiment Library

University of North Texas, The LibrariesUniversity of North Texas, The Libraries

Stanford University Library’s Social Stanford University Library’s Social Sciences Research CenterSciences Research Center

University of California Curatorial University of California Curatorial PartnersPartners

UCLA Online Campaign Literature ArchiveUCLA Online Campaign Literature ArchiveUC Berkeley Institute of Governmental UC Berkeley Institute of Governmental Studies LibraryStudies LibraryUC Berkeley Institute of Industrial UC Berkeley Institute of Industrial Relations LibrayRelations LibrayEight UC Libraries in the Eight UC Libraries in the Federal Federal Depository Library Program:Depository Library Program:– Berkeley, Davis, Irvine, UCLA, Riverside, San Berkeley, Davis, Irvine, UCLA, Riverside, San

Diego, Santa Barbara, Santa CruzDiego, Santa Barbara, Santa Cruz

The Institute of Industrial Relations:The Institute of Industrial Relations:Capturing Labor History in ActionCapturing Labor History in Action

News, data and links are being generated by News, data and links are being generated by unions at both the international and local levelunions at both the international and local level

Union priorities are necessarily Union priorities are necessarily “just in time”“just in time” and and they operate in a state of they operate in a state of high triagehigh triage

Preserving these data is a high priority for IIR Preserving these data is a high priority for IIR and the NYU Tamiment Libraryand the NYU Tamiment Library

It’s not likely that a non-academic host will do so, It’s not likely that a non-academic host will do so, making the challenge more urgentmaking the challenge more urgent

Where Things Stand NowWhere Things Stand Now

We’ve got a Wiki and curators are in touchWe’ve got a Wiki and curators are in touch

IIR and NYU/Tamiment are coordinating IIR and NYU/Tamiment are coordinating on labor issueson labor issues

Technical issues have moved to the foreTechnical issues have moved to the fore– Figuring out the configuration of the crawler, Figuring out the configuration of the crawler,

what to crawlwhat to crawl

The first crawl report has come backThe first crawl report has come back

The results are provocative and interestingThe results are provocative and interesting

First Crawl HighlightsFirst Crawl Highlights

30 sites crawled, max set to 1 gigabyte30 sites crawled, max set to 1 gigabyte– 18 hit the 1 gigabyte limit18 hit the 1 gigabyte limit

Average files on host: 6,359Average files on host: 6,359

Average with Linked hosts included: 17,247Average with Linked hosts included: 17,247

Most files on a single server: 46,197Most files on a single server: 46,197

Median Duration of crawl (host): 7hr 33mMedian Duration of crawl (host): 7hr 33m

The crawler, Heritrix 1.5.1, returned different The crawler, Heritrix 1.5.1, returned different data than other crawlers data than other crawlers (HTTrack, Wget)(HTTrack, Wget)

Rights and Permissions Vary Rights and Permissions Vary According to HostAccording to Host

A three level scheme for future rights A three level scheme for future rights management:management:

Consent Implied:Consent Implied: Crawl without permissionCrawl without permission– 14 sites in this category14 sites in this categoryConsent Sought:Consent Sought: Crawl but also identify and notify Crawl but also identify and notify the data ownerthe data owner– 13 sites in this category13 sites in this categoryConsent Required:Consent Required: A Advance permission neededdvance permission needed– 3 sites in this category3 sites in this category

Web aRchive Access (WERA)Web aRchive Access (WERA)

An open source tool for viewing crawl An open source tool for viewing crawl resultsresults

Very new, very much still in developmentVery new, very much still in development

Relies upon a search query to display the Relies upon a search query to display the crawled resourcescrawled resources

Does not really present how an average Does not really present how an average user would utilize a finished collectionuser would utilize a finished collection

The Fine Print MattersThe Fine Print Matters

Hetrix 1.5.1 Hetrix 1.5.1 doesn’t capture the directory doesn’t capture the directory tree of serverstree of servers —it follows links —it follows links

Many domains involve multiple servers, Many domains involve multiple servers, and crucial files (such as CSS libraries) and crucial files (such as CSS libraries) need to be capturedneed to be captured

The value of capturing linked files varies The value of capturing linked files varies from site to site, from irrelevant to vitally from site to site, from irrelevant to vitally importantimportant

Curator PerspectivesCurator Perspectives

Need to capture “new publications” as they Need to capture “new publications” as they appearappearBy a slight majority, monthly intervals are By a slight majority, monthly intervals are favored for crawl frequencyfavored for crawl frequencyHow much multimedia be captured? The 1 How much multimedia be captured? The 1 gigabyte limit obscured the answergigabyte limit obscured the answerAbout 70 percent of curators rated the crawl as About 70 percent of curators rated the crawl as “mostly effective”“mostly effective”Curators approached the process collaboratively Curators approached the process collaboratively from the very beginning—communicating from the very beginning—communicating proactively. This implies that proactively. This implies that collaborative collaborative collection development is viablecollection development is viable

What’s NeededWhat’s Needed

Curators want to see some sort of user Curators want to see some sort of user interface to evaluate interface to evaluate the experience of the experience of viewingviewing archived Web resources archived Web resourcesThe relationship between a particular host The relationship between a particular host and and whatever it links towhatever it links to is stimulating is stimulating debate—probably, both are neededdebate—probably, both are neededLong term sustainability of this project will Long term sustainability of this project will depend on depend on attracting interestattracting interest from from government and industry government and industry

Looking AheadLooking Ahead

The Open Access toolkit will be rigorously tested The Open Access toolkit will be rigorously tested (and will not appear for at least 2 years)(and will not appear for at least 2 years)

This approach places most responsibility with This approach places most responsibility with curators—just as special collection development curators—just as special collection development activity would mandateactivity would mandate

This is a new stream of work for information This is a new stream of work for information professionals—but the standarization of the professionals—but the standarization of the toolkit could be an important innovationtoolkit could be an important innovation

ConclusionsConclusions

The profession-wide culture of collaborative The profession-wide culture of collaborative collection development is alive and well—and collection development is alive and well—and digesting new digital collection strategiesdigesting new digital collection strategies

The combination of a toolkit “deliverable” and the The combination of a toolkit “deliverable” and the pooled experience of the cohort will be pooled experience of the cohort will be enormously useful for all digital librariansenormously useful for all digital librarians

Hands-on collection experts are in an excellent Hands-on collection experts are in an excellent position to advise technologists in the creation of position to advise technologists in the creation of new digital archiving tools— new digital archiving tools— at the ground levelat the ground level

URLs ReferencedURLs Referenced

The Web at Risk: The Web at Risk: http://http://www.cdlib.org/inside/projects/preservation/webatriskwww.cdlib.org/inside/projects/preservation/webatrisk//

Heritrix Web Site: Heritrix Web Site: http:////crawler.archive.org/http:////crawler.archive.org/

Web aRchive Access: Web aRchive Access: http://nea.nb.no/http://nea.nb.no/

UCLA Campaign Literature Archive: UCLA Campaign Literature Archive: http://digital.library.ucla.edu/campaignhttp://digital.library.ucla.edu/campaign

The AFL-CIO: The AFL-CIO: http://www.aflcio.orghttp://www.aflcio.org

Service Employees International Union: Service Employees International Union: http://www.seiu.orghttp://www.seiu.org

Change to Win: Change to Win: http://www.changetowin.orghttp://www.changetowin.org

The Institute of Industrial Relations Library: The Institute of Industrial Relations Library: http://http://www.iir.berkeley.eduwww.iir.berkeley.edu/library/library

Digital Preservation and the Digital Preservation and the Open Web:Open Web:

A Curatorial PerspectiveA Curatorial Perspective

Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations

University of California, BerkeleyUniversity of California, Berkeley

Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006

digital preservation and the open web: a curatorial perspective

Documents

web crawler

web archivingpartners

open webdevelopment

curatorial perspectiveterence

collection partners

crawl report

median duration of crawl

nonacademic host