digital preservation and the open web: a curatorial perspective
DESCRIPTION
Digital Preservation and the Open Web: A Curatorial Perspective. Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006. A Brief Description of “The Web at Risk” Project How it’s organized, who’s involved - PowerPoint PPT PresentationTRANSCRIPT
Digital Preservation and the Digital Preservation and the Open Web:Open Web:
A Curatorial PerspectiveA Curatorial Perspective
Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations
University of California, BerkeleyUniversity of California, Berkeley
Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006
OverviewOverview
A Brief Description of “The Web at Risk” A Brief Description of “The Web at Risk” ProjectProject– How it’s organized, who’s involvedHow it’s organized, who’s involved
Objectives of the ProjectObjectives of the Project– Preservation of the open WebPreservation of the open Web– Development of an open source “Tool Kit”Development of an open source “Tool Kit”
How it works, where it’s going, from a How it works, where it’s going, from a “special collections” perspective“special collections” perspective
The Web at Risk ProjectThe Web at Risk Project
3 year, 2.4 million dollar grant from the 3 year, 2.4 million dollar grant from the Library of Congress/National Digital Library of Congress/National Digital Information Infrastructure (NDIIPP)Information Infrastructure (NDIIPP)Coordinating Agency: Coordinating Agency: The California The California Digital Library Digital Library Primary focus on developing open access Primary focus on developing open access archiving tools that can be applied to any archiving tools that can be applied to any discipline with Web content worth keepingdiscipline with Web content worth keeping
Extensible, modular, easily configured to work with Extensible, modular, easily configured to work with existing technologies that are already in placeexisting technologies that are already in place
Project StagesProject Stages
Content Identification and SelectionContent Identification and Selection– Key issues for analysis, framework for sample crawls, working Key issues for analysis, framework for sample crawls, working
with collection partners, exploring extensibilitywith collection partners, exploring extensibility
Content AcquisitionContent Acquisition– Content Harvest and Acquisition, configuring of Content Harvest and Acquisition, configuring of Web CrawlerWeb Crawler, ,
AnalyzerAnalyzer, , Content User InterfaceContent User Interface (CUI), (CUI), Export/Import HandlerExport/Import Handler
Content RetentionContent Retention– Data model for Web Archive Digital Objects (WADO) testing Data model for Web Archive Digital Objects (WADO) testing
and modification, assessing the CDL Digital Preservation and modification, assessing the CDL Digital Preservation Repository for ingest and retentionRepository for ingest and retention
Partnership BuildingPartnership Building– Model Agreements for content retention, evaluate future steps, Model Agreements for content retention, evaluate future steps,
assess costs of sustaining a distributed approach to Web assess costs of sustaining a distributed approach to Web archivingarchiving
Partners in this NDIIPP GrantPartners in this NDIIPP Grant
Main Partners: Main Partners: – New York UniversityNew York University– University of North Texas, The LibrariesUniversity of North Texas, The Libraries– Texas Center for Digital KnowledgeTexas Center for Digital Knowledge
Technical Partners: Technical Partners: – UC San Diego Supercomputer CenterUC San Diego Supercomputer Center– Stanford University Computer Science Stanford University Computer Science
Department Department – Sun Microsystems, Inc.Sun Microsystems, Inc.
National Curatorial PartnersNational Curatorial Partners
Arizona State University Library and Arizona State University Library and ArchiveArchive
New York University Tamiment LibraryNew York University Tamiment Library
University of North Texas, The LibrariesUniversity of North Texas, The Libraries
Stanford University Library’s Social Stanford University Library’s Social Sciences Research CenterSciences Research Center
University of California Curatorial University of California Curatorial PartnersPartners
UCLA Online Campaign Literature ArchiveUCLA Online Campaign Literature ArchiveUC Berkeley Institute of Governmental UC Berkeley Institute of Governmental Studies LibraryStudies LibraryUC Berkeley Institute of Industrial UC Berkeley Institute of Industrial Relations LibrayRelations LibrayEight UC Libraries in the Eight UC Libraries in the Federal Federal Depository Library Program:Depository Library Program:– Berkeley, Davis, Irvine, UCLA, Riverside, San Berkeley, Davis, Irvine, UCLA, Riverside, San
Diego, Santa Barbara, Santa CruzDiego, Santa Barbara, Santa Cruz
The Institute of Industrial Relations:The Institute of Industrial Relations:Capturing Labor History in ActionCapturing Labor History in Action
News, data and links are being generated by News, data and links are being generated by unions at both the international and local levelunions at both the international and local level
Union priorities are necessarily Union priorities are necessarily “just in time”“just in time” and and they operate in a state of they operate in a state of high triagehigh triage
Preserving these data is a high priority for IIR Preserving these data is a high priority for IIR and the NYU Tamiment Libraryand the NYU Tamiment Library
It’s not likely that a non-academic host will do so, It’s not likely that a non-academic host will do so, making the challenge more urgentmaking the challenge more urgent
Where Things Stand NowWhere Things Stand Now
We’ve got a Wiki and curators are in touchWe’ve got a Wiki and curators are in touch
IIR and NYU/Tamiment are coordinating IIR and NYU/Tamiment are coordinating on labor issueson labor issues
Technical issues have moved to the foreTechnical issues have moved to the fore– Figuring out the configuration of the crawler, Figuring out the configuration of the crawler,
what to crawlwhat to crawl
The first crawl report has come backThe first crawl report has come back
The results are provocative and interestingThe results are provocative and interesting
First Crawl HighlightsFirst Crawl Highlights
30 sites crawled, max set to 1 gigabyte30 sites crawled, max set to 1 gigabyte– 18 hit the 1 gigabyte limit18 hit the 1 gigabyte limit
Average files on host: 6,359Average files on host: 6,359
Average with Linked hosts included: 17,247Average with Linked hosts included: 17,247
Most files on a single server: 46,197Most files on a single server: 46,197
Median Duration of crawl (host): 7hr 33mMedian Duration of crawl (host): 7hr 33m
The crawler, Heritrix 1.5.1, returned different The crawler, Heritrix 1.5.1, returned different data than other crawlers data than other crawlers (HTTrack, Wget)(HTTrack, Wget)
Rights and Permissions Vary Rights and Permissions Vary According to HostAccording to Host
A three level scheme for future rights A three level scheme for future rights management:management:
Consent Implied:Consent Implied: Crawl without permissionCrawl without permission– 14 sites in this category14 sites in this categoryConsent Sought:Consent Sought: Crawl but also identify and notify Crawl but also identify and notify the data ownerthe data owner– 13 sites in this category13 sites in this categoryConsent Required:Consent Required: A Advance permission neededdvance permission needed– 3 sites in this category3 sites in this category
Web aRchive Access (WERA)Web aRchive Access (WERA)
An open source tool for viewing crawl An open source tool for viewing crawl resultsresults
Very new, very much still in developmentVery new, very much still in development
Relies upon a search query to display the Relies upon a search query to display the crawled resourcescrawled resources
Does not really present how an average Does not really present how an average user would utilize a finished collectionuser would utilize a finished collection
The Fine Print MattersThe Fine Print Matters
Hetrix 1.5.1 Hetrix 1.5.1 doesn’t capture the directory doesn’t capture the directory tree of serverstree of servers —it follows links —it follows links
Many domains involve multiple servers, Many domains involve multiple servers, and crucial files (such as CSS libraries) and crucial files (such as CSS libraries) need to be capturedneed to be captured
The value of capturing linked files varies The value of capturing linked files varies from site to site, from irrelevant to vitally from site to site, from irrelevant to vitally importantimportant
Curator PerspectivesCurator Perspectives
Need to capture “new publications” as they Need to capture “new publications” as they appearappearBy a slight majority, monthly intervals are By a slight majority, monthly intervals are favored for crawl frequencyfavored for crawl frequencyHow much multimedia be captured? The 1 How much multimedia be captured? The 1 gigabyte limit obscured the answergigabyte limit obscured the answerAbout 70 percent of curators rated the crawl as About 70 percent of curators rated the crawl as “mostly effective”“mostly effective”Curators approached the process collaboratively Curators approached the process collaboratively from the very beginning—communicating from the very beginning—communicating proactively. This implies that proactively. This implies that collaborative collaborative collection development is viablecollection development is viable
What’s NeededWhat’s Needed
Curators want to see some sort of user Curators want to see some sort of user interface to evaluate interface to evaluate the experience of the experience of viewingviewing archived Web resources archived Web resourcesThe relationship between a particular host The relationship between a particular host and and whatever it links towhatever it links to is stimulating is stimulating debate—probably, both are neededdebate—probably, both are neededLong term sustainability of this project will Long term sustainability of this project will depend on depend on attracting interestattracting interest from from government and industry government and industry
Looking AheadLooking Ahead
The Open Access toolkit will be rigorously tested The Open Access toolkit will be rigorously tested (and will not appear for at least 2 years)(and will not appear for at least 2 years)
This approach places most responsibility with This approach places most responsibility with curators—just as special collection development curators—just as special collection development activity would mandateactivity would mandate
This is a new stream of work for information This is a new stream of work for information professionals—but the standarization of the professionals—but the standarization of the toolkit could be an important innovationtoolkit could be an important innovation
ConclusionsConclusions
The profession-wide culture of collaborative The profession-wide culture of collaborative collection development is alive and well—and collection development is alive and well—and digesting new digital collection strategiesdigesting new digital collection strategies
The combination of a toolkit “deliverable” and the The combination of a toolkit “deliverable” and the pooled experience of the cohort will be pooled experience of the cohort will be enormously useful for all digital librariansenormously useful for all digital librarians
Hands-on collection experts are in an excellent Hands-on collection experts are in an excellent position to advise technologists in the creation of position to advise technologists in the creation of new digital archiving tools— new digital archiving tools— at the ground levelat the ground level
URLs ReferencedURLs Referenced
The Web at Risk: The Web at Risk: http://http://www.cdlib.org/inside/projects/preservation/webatriskwww.cdlib.org/inside/projects/preservation/webatrisk//
Heritrix Web Site: Heritrix Web Site: http:////crawler.archive.org/http:////crawler.archive.org/
Web aRchive Access: Web aRchive Access: http://nea.nb.no/http://nea.nb.no/
UCLA Campaign Literature Archive: UCLA Campaign Literature Archive: http://digital.library.ucla.edu/campaignhttp://digital.library.ucla.edu/campaign
The AFL-CIO: The AFL-CIO: http://www.aflcio.orghttp://www.aflcio.org
Service Employees International Union: Service Employees International Union: http://www.seiu.orghttp://www.seiu.org
Change to Win: Change to Win: http://www.changetowin.orghttp://www.changetowin.org
The Institute of Industrial Relations Library: The Institute of Industrial Relations Library: http://http://www.iir.berkeley.eduwww.iir.berkeley.edu/library/library
Digital Preservation and the Digital Preservation and the Open Web:Open Web:
A Curatorial PerspectiveA Curatorial Perspective
Terence K. HuweTerence K. HuweInstitute of Industrial RelationsInstitute of Industrial Relations
University of California, BerkeleyUniversity of California, Berkeley
Computers In LibrariesComputers In LibrariesMarch, 2006March, 2006