webarchiv - archive of the czech web
TRANSCRIPT
WebArchiv - Archive of the Czech Web
5. 6. 2014
WebArchiv
• a digital archive of Czech web resources
• purposes of web archiving:
• growth of electronic online resourses • long-term preservation • at-risk content on web
Department of Web Archiving
History
• project started in 2000
• first document harvested in 3. 9. 2001
• IIPC member from 2007 !
• since 2008 part of National Digital Library
Today
• 87 TB archived data
• whole archive accesible in the library
• only selective harvests accesible online
• more then 4000 archived websites with online access
• 3 people in the deparment + 1 IT guy
• focus on long-term preservation
Legal Issues
• Legal deposit act - doesn’t cover online-born documents
• Copyright act - only the library licence which allows library to make a reproduction of a work for own archiving or conservation purpose
• Online access - based on contracts with publishers or on Creative Commons licence
Web Archive Content
1. Comprehesive harvests
2. Selective harvests
3. Topic collections
Comprehensive harvests
• contract with czech domain provider CZ.NIC
!
• once a year crawl of the whole .cz domain
• accesible only in the library
• a maximum of 5000 harvested files per site
Selective Harvests
• selective approach: • territory • language • autorship • topic/content
• curated resourses • crawled periodically (several frequencies) • communication with publishers • online access • cataloging
Topic Collections
• collection of resources which are related to certain event of topic
• for example: • presidential elections • floods • olympic games
Workflow
• selecting and evaluating
• contracting with publishers
• harvesting
• access and quality assurance
Software
• crawler: Hertrix
• access: Open Wayback
• web curator tool: WA admin
!
• https://github.com/WebArchivCZ/
Thank you for you attention. !!Barbora Bjačková [email protected] !Jaroslav Kvasnica [email protected]
http://www.webarchiv.cz