webarchiv - archive of the czech web

13

Click here to load reader

Upload: jaroslav-kvasnica

Post on 12-Jul-2015

949 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: WebArchiv - Archive of the Czech Web

WebArchiv - Archive of the Czech Web

5. 6. 2014

Page 2: WebArchiv - Archive of the Czech Web

WebArchiv

• a digital archive of Czech web resources

• purposes of web archiving:

• growth of electronic online resourses • long-term preservation • at-risk content on web

Page 3: WebArchiv - Archive of the Czech Web

Department of Web Archiving

Page 4: WebArchiv - Archive of the Czech Web

History

• project started in 2000

• first document harvested in 3. 9. 2001

• IIPC member from 2007 !

• since 2008 part of National Digital Library

Page 5: WebArchiv - Archive of the Czech Web

Today

• 87 TB archived data

• whole archive accesible in the library

• only selective harvests accesible online

• more then 4000 archived websites with online access

• 3 people in the deparment + 1 IT guy

• focus on long-term preservation

Page 6: WebArchiv - Archive of the Czech Web

Legal Issues

• Legal deposit act - doesn’t cover online-born documents

• Copyright act - only the library licence which allows library to make a reproduction of a work for own archiving or conservation purpose

• Online access - based on contracts with publishers or on Creative Commons licence

Page 7: WebArchiv - Archive of the Czech Web

Web Archive Content

1. Comprehesive harvests

2. Selective harvests

3. Topic collections

Page 8: WebArchiv - Archive of the Czech Web

Comprehensive harvests

• contract with czech domain provider CZ.NIC

!

• once a year crawl of the whole .cz domain

• accesible only in the library

• a maximum of 5000 harvested files per site

Page 9: WebArchiv - Archive of the Czech Web

Selective Harvests

• selective approach: • territory • language • autorship • topic/content

• curated resourses • crawled periodically (several frequencies) • communication with publishers • online access • cataloging

Page 10: WebArchiv - Archive of the Czech Web

Topic Collections

• collection of resources which are related to certain event of topic

• for example: • presidential elections • floods • olympic games

Page 11: WebArchiv - Archive of the Czech Web

Workflow

• selecting and evaluating

• contracting with publishers

• harvesting

• access and quality assurance

Page 12: WebArchiv - Archive of the Czech Web

Software

• crawler: Hertrix

• access: Open Wayback

• web curator tool: WA admin

!

• https://github.com/WebArchivCZ/

Page 13: WebArchiv - Archive of the Czech Web

Thank you for you attention. !!Barbora Bjačková [email protected] !Jaroslav Kvasnica [email protected]

http://www.webarchiv.cz