http:// webarchiv czech web archive iipc 2007, paris

9
http://www.webarchiv.cz WebArchiv Czech Web Archive IIPC 2007, Paris

Upload: anthony-jordan

Post on 25-Dec-2015

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz

WebArchiv

Czech Web Archive

IIPC 2007, Paris

Page 2: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv – overview

The Czech WebArchiv was originally funded by the Ministry of Culture and launched in 2000.

Since then the project has been implemented by the National Library in cooperation with the Moravian Library and the Institute of Computer Science of Masaryk University.

Both large-scale automated harvesting of the entire Czech national web and selective archiving are being carried out, including thematic, event-based collections (using Heritrix).

Due to copyright law, only restricted on-site access from within the library is possible to all files in the archive (using wayback).

Archived resources which are covered by a written agreement with their publisher are accessible online using WERA.

Page 3: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv – Workflows Prague:

Resource selection Cataloguing for the National Bibliography (MARC21) Providing Dublin Core metadata for interested publishers Making archive access agreements with publishers

Brno: Running WebArchiv hardware Software localization, maintenance and development Pre-harvesting resource analysis Harvesting, indexing, access

Results so far: 4 harvesting rounds of .cz domain (2001, 2002, 2004, 2006) 5 event-oriented harvests several times per year – harvests of sites under agreements 5.4 TB archive with 136 million files

Page 4: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv – Tools

Software tools: Web Based Dublin Core metadata creator National Bibliography Number (NBN) generator Heritrix crawler NutchWAX, WERA – full text indexing & public archive access wa-cz – locally developed infrastructure WayBack – Wayback Machine like interface for whole archive,

limited access Hardware:

3 HP ProLiant servers, 5.8 TB SATA disc array awaiting transfer of the archive files to National Library’s central

storage facility (25+ TB, mirrored, FC+SATA) later this year

Page 5: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv – Infrastructure

A1 new crawl; A2 end crawl -> index; A3 update fulltext; A4 update host list

Page 6: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

Page 7: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

WebArchiv - Future Work

Workflow management application Harvesting of bohemical resources outside the .cz domain

language analysis feedback from Heritrix about dropped URLs from .cz crawl

Adaptive incremental harvesting, incremental indexing Selective harvesting on demand Fulltext indexing of the whole archive Identification of similar documents Permanent linking into the archive (permanent ID) Integration of the archive into planned National Digital Library

(selection of software 2008) Long-term preservation (via NDL system) Implementation of digital library standards: OAI-PMH, METS,

SRU/SRW

Page 8: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

Archive daily ingest

0

500000

1000000

1500000

2000000

2500000

3000000

1.9.2001

1.11.2001

1.1.2002

1.3.2002

1.5.2002

1.7.2002

1.9.2002

1.11.2002

1.1.2003

1.3.2003

1.5.2003

1.7.2003

1.9.2003

1.11.2003

1.1.2004

1.3.2004

1.5.2004

1.7.2004

1.9.2004

1.11.2004

1.1.2005

1.3.2005

1.5.2005

1.7.2005

1.9.2005

1.11.2005

1.1.2006

1.3.2006

1.5.2006

1.7.2006

1.9.2006

1.11.2006

1.1.2007

1.3.2007

cz2006

cz2004

cz2002cz2001

cz2005

agreements

agreemen

NEDLIB harvester Heritrix

Num

ber

of f

iles

Page 9: Http:// WebArchiv Czech Web Archive IIPC 2007, Paris

http://www.webarchiv.cz IIPC 2007

People

Librarians, project management: National Library: 3.5 FTE

IT management Moravian Library – 1 part-time

IT Masaryk University – 6 part-time