how to face the challenges of web archiving? the experiences of a small library on the edge
DESCRIPTION
TRANSCRIPT
How to Face the Challenges of Web Archiving?
The experiences of a small library on the edge.
Chloe Martin, Internet Memory Catherine Ryan, National Library of Ireland
LIBER 2012 - 1
Context: National Library of Ireland
• Beginnings: Established by the Dublin Science and Museum Act, 1877
• Mission: “to collect, preserve, promote and make accessible the documentary and intellectual record of the life of Ireland”.
• The Digital Record: Born Digital Programme established in 2010, covering web archiving.
• Web Archive Projects: 2 pilot projects in 2011
LIBER 2012 - 2
Context: Internet Memory
European Archive / Internet Memory Foundation•Established in 2004 in Amsterdam (offices also in Paris)•Mission: to preserve Web content as a new media for current and future generations •Actions: Sensibilization, partnerships, R&D•Open Access Collections: UK National Archives & Parliament, PRONI, CERN and The National Library of Ireland
Internet Memory Research•Spin-off of IM established in June 2011 in Paris•Missions: to operate large scale or selective crawls & develop new technologies (crawl, access, processing and extraction)
LIBER 2012 - 3
Web Archiving Project: Project Origins National Library of Ireland
Building a 21st Century Library:
– Born Digital– Digitisation– Single Integrated Catalogue– Digital Repository– OSCAIL, the Digital Library Programme
LIBER 2012 - 4
Web Archiving Project: Project Origins National Library of Ireland
Born Digital Materials:• Natural progression for NLI’s strong political,
cultural and historical collections• How best to approach this in time of
unprecedented financial difficulty?• Born Digital Programme established to examine
requirements and produce a policy document for the next steps
LIBER 2012 - 5
Web Archiving Project: Project Origins National Library of Ireland
The Hand of History:
– Snap General Election
– Five Weeks
LIBER 2012 - 6
Web Archiving Project: Project Origins National Library of Ireland
Just do it
LIBER 2012 - 7
Web Archiving Project: Project Origins National Library of Ireland
Just do it
How?
LIBER 2012 - 8
Web Archiving Project: Project Origins National Library of Ireland
Collaborative Partnership:
Partner that suited our requirements and that had experience with others in the cultural sector
Requirements:– Technical skills in the
NLI but working on other projects – needed these skills
– Leverage NLI’s on strong curatorial experience, esp. in politics
– Fast!
LIBER 2012 - 9
Web Archiving Project: Project OriginsNational Library of Ireland
Project phases:
– Project scoping and contract– Site selection– Permissions gathering– QA (look and feel)– Publication and promotion
LIBER 2012 - 10
Site Selection and PermissionsNational Library of Ireland
Selection Criteria:
– Website presence– Technical reasons– Cut-off date– Women candidates
Permissions:
– All sites contacted and provided with a brief
– Pressurised but necessary phase
LIBER 2012 - 11
Scope of projectsNational Library of Ireland
General Election:
– Crawl: 200 snapshots– Scope: 100 seeds– Frequency: 2 times– Date: Feb. 2011
Presidential Election:
– Crawl: 80 snapshots– Scope: 70 seeds– Frequency: 3 times– Date: Oct-Nov. 2011
LIBER 2012 - 12
CrawlInternet Memory
• Seeds Validation: URLs, Duplication, Redirection, External links, Dynamic websites
• Scope Parameters: Domain, host and path ; Social Web content ; Frequency ; Robots.txt
files exclusion ; Politeness
• Specific incidents technical changes on the flyModification of scope ; Pending crawls ; Adaptation of the politeness
• Improvement of second crawl
LIBER 2012 - 13
Quality Assurance (QA)National Library of Ireland
• Manual QA
• Jira software
• IM – Technical QA
• NLI - ‘Look and Feel’ QA
• Multiple browsers
• Communication with site owners (building relationships and promotion)
LIBER 2012 - 14
Quality Assurance (QA)Internet Memory
• Why?
• How? • Manual and visual method: homepage + 2 • Resolution of issues
• Temporal Coherence
LIBER 2012 - 15
AccessNational Library of Ireland
• Available to the public
• Full text search
• IM website – search by keyword, URL
• NLI catalogue – keyword via widget developed by NLI IS team and IM
• Future – access through NLI’s own interfaces, issue of integrating results
LIBER 2012 - 16
Publication and PromotionNational Library of Ireland
• NLI social media initiative (Twitter and blog)
• Project participants
• Print media (esp. in area of technology)
• And IM!
• Usage figures have increased but real value more apparent in 5-10 years
LIBER 2012 - 17
Usage Statistics of Web ArchiveNational Library of Ireland
21/09/2011: Official launch of NLI Web archives (Tweets)
26/10/2011: Blog post on nli.ie/blog and Paper in thejournal.ie
25/11/2011: Paper on irishtimes.com
20/01/2012: Paper on irishtimes.com
17/03/2012: Post on soundofthearchives.wordpress.com
04/05/2012: Paper on irisheconomy.ie
LIBER 2012 - 18
Advantages of Web ArchivingNational Library of Ireland
Web archiving:– New opportunities for delivery of materials to
users– Work with existing users expectations that
content be online– Reach new audiences
LIBER 2012 - 19
Advantages of Web ArchivingNational Library of Ireland
Political web archives;Irish General Election:– Researchers can compare online content pre-
and post-election– Facilitates research into how ‘online’ this
election was– Assess impact of technological developments
in campaign communications– Record of campaign information
LIBER 2012 - 20
Benefits of Working TogetherNational Library of Ireland
Pilot project for a long-term activity:– Allowed us to enter a new collecting area
despite lack of tech expertise– Facilitated collection of important material that
one else was collecting– Collect material quickly– Leverage curatorial skills– Gained new technical skills
LIBER 2012 - 21
Benefits of Working TogetherInternet Memory
• To supporte the development of Web archiving initiatives
• To operate rapid deployment of Web archives
• To address new challenges in this area:• Social media content• QA• Automatization
LIBER 2012 - 22
Conclusion
General Election:• 18,495,771 URLs• 1.14 TB• 10,405 ARCs
Presidential Election:• 7,333,399 URLs• 278.10 GB• 2,513 ARCs
View the NLI collections at:http://www.nli.ie/en/udlist/digital-collections.aspx
View the Web archive blog entry at:http://www.nli.ie/blog/index.php/2011/10/26/general-election-2011-web-archiving/
View Internet Memory Collections at:http://collections.europarchive.org/
To be continued…
LIBER 2012 - 23
LIBER 2012 - 24
Questions?
Thanks for your attention!
Chloe MartinInternet
Memoryhttp://internetmemory.org
[email protected]@InternetMemory
Catherine RyanNational Library of Irelandhttp://[email protected]@NLIreland