digital preservation and the open web: a curatorial perspective

31
Digital Preservation and Digital Preservation and the Open Web: the Open Web: A Curatorial Perspective A Curatorial Perspective Terence K. Huwe Terence K. Huwe Institute of Industrial Relations Institute of Industrial Relations University of California, Berkeley University of California, Berkeley Computers In Libraries Computers In Libraries March, 2006 March, 2006

Upload: edda

Post on 10-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Digital Preservation and the Open Web: A Curatorial Perspective. Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006. A Brief Description of “The Web at Risk” Project How it’s organized, who’s involved - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Digital Preservation and the Open Web: A Curatorial Perspective

Digital Preservation and the Digital Preservation and the Open Web:Open Web:

A Curatorial PerspectiveA Curatorial Perspective

Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations

University of California, BerkeleyUniversity of California, Berkeley

Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006

Page 2: Digital Preservation and the Open Web: A Curatorial Perspective

OverviewOverview

A Brief Description of “The Web at Risk” A Brief Description of “The Web at Risk” ProjectProject– How it’s organized, who’s involvedHow it’s organized, who’s involved

Objectives of the ProjectObjectives of the Project– Preservation of the open WebPreservation of the open Web– Development of an open source “Tool Kit”Development of an open source “Tool Kit”

How it works, where it’s going, from a How it works, where it’s going, from a “special collections” perspective“special collections” perspective

Page 3: Digital Preservation and the Open Web: A Curatorial Perspective

The Web at Risk ProjectThe Web at Risk Project

3 year, 2.4 million dollar grant from the 3 year, 2.4 million dollar grant from the Library of Congress/National Digital Library of Congress/National Digital Information Infrastructure (NDIIPP)Information Infrastructure (NDIIPP)Coordinating Agency: Coordinating Agency: The California The California Digital Library Digital Library Primary focus on developing open access Primary focus on developing open access archiving tools that can be applied to any archiving tools that can be applied to any discipline with Web content worth keepingdiscipline with Web content worth keeping

Extensible, modular, easily configured to work with Extensible, modular, easily configured to work with existing technologies that are already in placeexisting technologies that are already in place

Page 4: Digital Preservation and the Open Web: A Curatorial Perspective

Project StagesProject Stages

Content Identification and SelectionContent Identification and Selection– Key issues for analysis, framework for sample crawls, working Key issues for analysis, framework for sample crawls, working

with collection partners, exploring extensibilitywith collection partners, exploring extensibility

Content AcquisitionContent Acquisition– Content Harvest and Acquisition, configuring of Content Harvest and Acquisition, configuring of Web CrawlerWeb Crawler, ,

AnalyzerAnalyzer, , Content User InterfaceContent User Interface (CUI), (CUI), Export/Import HandlerExport/Import Handler

Content RetentionContent Retention– Data model for Web Archive Digital Objects (WADO) testing Data model for Web Archive Digital Objects (WADO) testing

and modification, assessing the CDL Digital Preservation and modification, assessing the CDL Digital Preservation Repository for ingest and retentionRepository for ingest and retention

Partnership BuildingPartnership Building– Model Agreements for content retention, evaluate future steps, Model Agreements for content retention, evaluate future steps,

assess costs of sustaining a distributed approach to Web assess costs of sustaining a distributed approach to Web archivingarchiving

Page 5: Digital Preservation and the Open Web: A Curatorial Perspective

Partners in this NDIIPP GrantPartners in this NDIIPP Grant

Main Partners: Main Partners: – New York UniversityNew York University– University of North Texas, The LibrariesUniversity of North Texas, The Libraries– Texas Center for Digital KnowledgeTexas Center for Digital Knowledge

Technical Partners: Technical Partners: – UC San Diego Supercomputer CenterUC San Diego Supercomputer Center– Stanford University Computer Science Stanford University Computer Science

Department Department – Sun Microsystems, Inc.Sun Microsystems, Inc.

Page 6: Digital Preservation and the Open Web: A Curatorial Perspective

National Curatorial PartnersNational Curatorial Partners

Arizona State University Library and Arizona State University Library and ArchiveArchive

New York University Tamiment LibraryNew York University Tamiment Library

University of North Texas, The LibrariesUniversity of North Texas, The Libraries

Stanford University Library’s Social Stanford University Library’s Social Sciences Research CenterSciences Research Center

Page 7: Digital Preservation and the Open Web: A Curatorial Perspective

University of California Curatorial University of California Curatorial PartnersPartners

UCLA Online Campaign Literature ArchiveUCLA Online Campaign Literature ArchiveUC Berkeley Institute of Governmental UC Berkeley Institute of Governmental Studies LibraryStudies LibraryUC Berkeley Institute of Industrial UC Berkeley Institute of Industrial Relations LibrayRelations LibrayEight UC Libraries in the Eight UC Libraries in the Federal Federal Depository Library Program:Depository Library Program:– Berkeley, Davis, Irvine, UCLA, Riverside, San Berkeley, Davis, Irvine, UCLA, Riverside, San

Diego, Santa Barbara, Santa CruzDiego, Santa Barbara, Santa Cruz

Page 8: Digital Preservation and the Open Web: A Curatorial Perspective
Page 9: Digital Preservation and the Open Web: A Curatorial Perspective
Page 10: Digital Preservation and the Open Web: A Curatorial Perspective
Page 11: Digital Preservation and the Open Web: A Curatorial Perspective

The Institute of Industrial Relations:The Institute of Industrial Relations:Capturing Labor History in ActionCapturing Labor History in Action

News, data and links are being generated by News, data and links are being generated by unions at both the international and local levelunions at both the international and local level

Union priorities are necessarily Union priorities are necessarily “just in time”“just in time” and and they operate in a state of they operate in a state of high triagehigh triage

Preserving these data is a high priority for IIR Preserving these data is a high priority for IIR and the NYU Tamiment Libraryand the NYU Tamiment Library

It’s not likely that a non-academic host will do so, It’s not likely that a non-academic host will do so, making the challenge more urgentmaking the challenge more urgent

Page 12: Digital Preservation and the Open Web: A Curatorial Perspective
Page 13: Digital Preservation and the Open Web: A Curatorial Perspective
Page 14: Digital Preservation and the Open Web: A Curatorial Perspective
Page 15: Digital Preservation and the Open Web: A Curatorial Perspective
Page 16: Digital Preservation and the Open Web: A Curatorial Perspective

Where Things Stand NowWhere Things Stand Now

We’ve got a Wiki and curators are in touchWe’ve got a Wiki and curators are in touch

IIR and NYU/Tamiment are coordinating IIR and NYU/Tamiment are coordinating on labor issueson labor issues

Technical issues have moved to the foreTechnical issues have moved to the fore– Figuring out the configuration of the crawler, Figuring out the configuration of the crawler,

what to crawlwhat to crawl

The first crawl report has come backThe first crawl report has come back

The results are provocative and interestingThe results are provocative and interesting

Page 17: Digital Preservation and the Open Web: A Curatorial Perspective
Page 18: Digital Preservation and the Open Web: A Curatorial Perspective
Page 19: Digital Preservation and the Open Web: A Curatorial Perspective
Page 20: Digital Preservation and the Open Web: A Curatorial Perspective

First Crawl HighlightsFirst Crawl Highlights

30 sites crawled, max set to 1 gigabyte30 sites crawled, max set to 1 gigabyte– 18 hit the 1 gigabyte limit18 hit the 1 gigabyte limit

Average files on host: 6,359Average files on host: 6,359

Average with Linked hosts included: 17,247Average with Linked hosts included: 17,247

Most files on a single server: 46,197Most files on a single server: 46,197

Median Duration of crawl (host): 7hr 33mMedian Duration of crawl (host): 7hr 33m

The crawler, Heritrix 1.5.1, returned different The crawler, Heritrix 1.5.1, returned different data than other crawlers data than other crawlers (HTTrack, Wget)(HTTrack, Wget)

Page 21: Digital Preservation and the Open Web: A Curatorial Perspective

Rights and Permissions Vary Rights and Permissions Vary According to HostAccording to Host

A three level scheme for future rights A three level scheme for future rights management:management:

Consent Implied:Consent Implied: Crawl without permissionCrawl without permission– 14 sites in this category14 sites in this categoryConsent Sought:Consent Sought: Crawl but also identify and notify Crawl but also identify and notify the data ownerthe data owner– 13 sites in this category13 sites in this categoryConsent Required:Consent Required: A Advance permission neededdvance permission needed– 3 sites in this category3 sites in this category

Page 22: Digital Preservation and the Open Web: A Curatorial Perspective

Web aRchive Access (WERA)Web aRchive Access (WERA)

An open source tool for viewing crawl An open source tool for viewing crawl resultsresults

Very new, very much still in developmentVery new, very much still in development

Relies upon a search query to display the Relies upon a search query to display the crawled resourcescrawled resources

Does not really present how an average Does not really present how an average user would utilize a finished collectionuser would utilize a finished collection

Page 23: Digital Preservation and the Open Web: A Curatorial Perspective
Page 24: Digital Preservation and the Open Web: A Curatorial Perspective
Page 25: Digital Preservation and the Open Web: A Curatorial Perspective

The Fine Print MattersThe Fine Print Matters

Hetrix 1.5.1 Hetrix 1.5.1 doesn’t capture the directory doesn’t capture the directory tree of serverstree of servers —it follows links —it follows links

Many domains involve multiple servers, Many domains involve multiple servers, and crucial files (such as CSS libraries) and crucial files (such as CSS libraries) need to be capturedneed to be captured

The value of capturing linked files varies The value of capturing linked files varies from site to site, from irrelevant to vitally from site to site, from irrelevant to vitally importantimportant

Page 26: Digital Preservation and the Open Web: A Curatorial Perspective

Curator PerspectivesCurator Perspectives

Need to capture “new publications” as they Need to capture “new publications” as they appearappearBy a slight majority, monthly intervals are By a slight majority, monthly intervals are favored for crawl frequencyfavored for crawl frequencyHow much multimedia be captured? The 1 How much multimedia be captured? The 1 gigabyte limit obscured the answergigabyte limit obscured the answerAbout 70 percent of curators rated the crawl as About 70 percent of curators rated the crawl as “mostly effective”“mostly effective”Curators approached the process collaboratively Curators approached the process collaboratively from the very beginning—communicating from the very beginning—communicating proactively. This implies that proactively. This implies that collaborative collaborative collection development is viablecollection development is viable

Page 27: Digital Preservation and the Open Web: A Curatorial Perspective

What’s NeededWhat’s Needed

Curators want to see some sort of user Curators want to see some sort of user interface to evaluate interface to evaluate the experience of the experience of viewingviewing archived Web resources archived Web resourcesThe relationship between a particular host The relationship between a particular host and and whatever it links towhatever it links to is stimulating is stimulating debate—probably, both are neededdebate—probably, both are neededLong term sustainability of this project will Long term sustainability of this project will depend on depend on attracting interestattracting interest from from government and industry government and industry

Page 28: Digital Preservation and the Open Web: A Curatorial Perspective

Looking AheadLooking Ahead

The Open Access toolkit will be rigorously tested The Open Access toolkit will be rigorously tested (and will not appear for at least 2 years)(and will not appear for at least 2 years)

This approach places most responsibility with This approach places most responsibility with curators—just as special collection development curators—just as special collection development activity would mandateactivity would mandate

This is a new stream of work for information This is a new stream of work for information professionals—but the standarization of the professionals—but the standarization of the toolkit could be an important innovationtoolkit could be an important innovation

Page 29: Digital Preservation and the Open Web: A Curatorial Perspective

ConclusionsConclusions

The profession-wide culture of collaborative The profession-wide culture of collaborative collection development is alive and well—and collection development is alive and well—and digesting new digital collection strategiesdigesting new digital collection strategies

The combination of a toolkit “deliverable” and the The combination of a toolkit “deliverable” and the pooled experience of the cohort will be pooled experience of the cohort will be enormously useful for all digital librariansenormously useful for all digital librarians

Hands-on collection experts are in an excellent Hands-on collection experts are in an excellent position to advise technologists in the creation of position to advise technologists in the creation of new digital archiving tools— new digital archiving tools— at the ground levelat the ground level

Page 30: Digital Preservation and the Open Web: A Curatorial Perspective

URLs ReferencedURLs Referenced

The Web at Risk: The Web at Risk: http://http://www.cdlib.org/inside/projects/preservation/webatriskwww.cdlib.org/inside/projects/preservation/webatrisk//

Heritrix Web Site: Heritrix Web Site: http:////crawler.archive.org/http:////crawler.archive.org/

Web aRchive Access: Web aRchive Access: http://nea.nb.no/http://nea.nb.no/

UCLA Campaign Literature Archive: UCLA Campaign Literature Archive: http://digital.library.ucla.edu/campaignhttp://digital.library.ucla.edu/campaign

The AFL-CIO: The AFL-CIO: http://www.aflcio.orghttp://www.aflcio.org

Service Employees International Union: Service Employees International Union: http://www.seiu.orghttp://www.seiu.org

Change to Win: Change to Win: http://www.changetowin.orghttp://www.changetowin.org

The Institute of Industrial Relations Library: The Institute of Industrial Relations Library: http://http://www.iir.berkeley.eduwww.iir.berkeley.edu/library/library

Page 31: Digital Preservation and the Open Web: A Curatorial Perspective

Digital Preservation and the Digital Preservation and the Open Web:Open Web:

A Curatorial PerspectiveA Curatorial Perspective

Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations

University of California, BerkeleyUniversity of California, Berkeley

Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006