web harvesting collaborations at library and archives canada tom smyth manager, digital capacity...

40
Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity [email protected] IIPC GA 2014

Upload: quinten-yonge

Post on 14-Dec-2015

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Web Harvesting Collaborationsat Library and Archives Canada

Tom SmythManager, Digital [email protected]

IIPC GA 2014

Page 2: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Operating Context

Page 3: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

LAC’s Legislative Context:Mandate

Library and Archives of Canada Act (S.C. 2004, c. 11)

PreambleWHEREAS it is necessary that

(a) the documentary heritage of Canada be preserved for the benefit of present and future generations;

(b) Canada be served by an institution that is a source of enduring knowledge accessible to all, contributing to the cultural, social and economic advancement of Canada as a free and democratic society;

(c) that institution facilitate in Canada cooperation among the communities involved in the acquisition, preservation and diffusion of knowledge; and

(d) that institution serve as the continuing memory of the government of Canada and its institutions;

Page 4: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Legislative Context:Authorities for Collection

Multiple authorities exist within the LAC Act that empower it to collect government information:

•Section 10: Legal Deposit– Covers all publications that are published in Canada,

including those from the GC.

•Section 12 & 13: Government and Ministerial Records– Covers the disposition, transfer, and right of access to

government records.

Page 5: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Legislative Context:Authorities for Collection (2)

Library and Archives of Canada ActOBJECTS AND POWERSSampling from Internet

Section 8. (2):

“In exercising the powers referred to in paragraph (1)(a) and for the purpose of preservation, the Librarian and Archivist may take, at the times and in the manner that he or she considers appropriate, a representative sample of the documentary material of interest to Canada that is accessible to the public without restriction through the Internet or any similar medium”.

Page 6: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

LAC’s Web Harvesting:Background

• LAC began harvesting with the Government of Canada web

presence in 2005

– Hybrid library and archival methodological context

– In total LAC has collected it four times (2005, 2006, 2007, 2013)

• LAC’s makes this harvested data openly accessible via the

Government of Canada Web Archive (GCWA, 2006)

• Thematic material collected since 2005, but not according to a

disciplined (library) collection development methodology until about

2009

Page 7: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

7

Page 8: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

8

Government of Canada Web Archive

Page 9: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

9

WebCan Crawl Management Tool

• Developed internally by LAC to allow acquisition staff to manage all operations: harvest definitions, seed lists, crawl management, quality assurance

Page 10: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Collection Overview

• LAC’s web archival collection consists of:– Four comprehensive crawls of the Canadian federal

government web presence (*.gc.ca, some *.ca)• (2005, 2006, 2007, 2013)

– Decommissioned federal websites emergency-harvested between domain crawl periods

• ~170 major departmental websites since 2009

– Thematic collections from the open Canadian web• ~15, built with increasingly complex collaborative relationships

Page 11: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

RecentCollaborative Projects

Page 12: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Models of Collaboration

• Internal collaboration– Librarians and Archivists at LAC working together to curate seedlists and

web collections

– 2005-2009

• Interdepartmental collaboration– Librarians, Archivists, PMs, Olympics Specialists

– Olympic and Paralympic Web Archive

– 2009-

• Internal, Interdepartmental, Stakeholders– Librarians, Archivists, Policy, Government Administration and Technical

Specialists from Central Agencies, and Govt Docs and Data Specialists

– 2013-

Page 13: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Olympic and ParalympicWeb Archive

Page 14: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Canadian Olympics Web Archive

• LAC started curating web collections to document Canadian participation in the Olympic and Paralympic games shortly after the programme launched in 2005.

• LAC has curated web archival collections for each of the following games:• Turin, Winter 2006• Beijing, Summer 2008• Vancouver, Winter 2010• London, summer 2012• Sochi, Winter 2014

Page 15: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Library and Archives CanadaVancouver 2010 Olympic and Paralympic Games

Activities

Page 16: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Olympics WA Key Questions

• Who will be consulting our web archives?• Why will they consulting our web archives?• How will they use the web archives?

– Data and text mining, looking for specific resources, social and cultural context

• What sort of resources warrant entry to a web archive?

• How can the web archive become a robust and multidisciplinary research tool?

Page 17: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Vancouver 2010 Web Archival Project

• LAC entered into a partnership with the Department of Canadian Heritage to build a web archive documenting the uniquely Canadian Vancouver 2010 games.– Selected based on the statistical needs of PCH (largely tourism)– Curated from the outset to cover broad social, cultural,

economic, infrastructural, academic topics for maximum data and research applicability.

– Nine iterative crawl jobs took place to capture 350+ websites (many 2-3 times each), comprising ~2 TB of data.

Page 18: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

18

Vancouver 2010 Olympic Games

• Seedlist selected by Canadian Heritage specialists,representatives from the Federal Secretariat for the Vancouver Games, and LAC librarians and archivists

• Olympics Studies methodologies were considered and factored in the curation of the collection (Olympics impact, infrastructure, sports medicine, sponsorship, coaching, etc.)

• Methodology created to assess target websites for currency, authority, perspective, frequency of content generation

Page 19: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

19

Vancouver 2010 Data Set

• Canadian Heritage was interested in the productionof a large data set that could be mined and analyzed primarily for cultural, social, and tourism information

• Selection methodology based in part of the target’s resource’s ability to contribute valuable information to a robust, minable dataset

• This methodology has informed every project since

Page 20: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

20

Vancouver 2010 Data Set

• Web collection curated to include the following:– Aboriginal and First Nations Perspectives

– Environmental impact perspectives

– Economic and infrastructural development and impact in Vancouver

– Public Policy and Think Tank perspectives

– Pro/Con perspectives in the media on hosting an Olympics

– A complete record of all the social and cultural events that ran during the games, including all the official sites reporting the day-to-day events and the results

– Tourism, Sponsorship, Own the Podium campaign, Torch Relay, etc.

– Subject matter of interest to Olympics and Sports Studies specialists

Page 21: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Library and Archives CanadaVancouver 2010 Olympic and Paralympic Games

Activities

Page 22: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014
Page 23: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Political, Social, Cultural, Historical

Thematic Web Collections

Page 24: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Thematic Collections Overview

• Federal and Provincial Elections• Royal Commissions and Commissions of Inquiry• Canada’s Participation in Olympic and Paralympic

Games• State Funerals• Transitions in Federal Organizations• Decommissioned Federal Organization/Websites• Visits to Canada from British Royals• Change of Governors General• 100th Anniversary of the Calgary Stampede• Commemoration of the War of 1812

Page 25: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Thematic Collections: Context

• Starting in January 2013, LAC began curating major thematic collections on political, social, cultural, commemorative, and historical topics

• Olympics Web Archive curation and project methodology influenced the thematic projects

• In 2013, one project was conducted per FY quarter:– Q1: The “Idle No More” Aboriginal movement in Canada– Q2: Development of the Keystone Oil Pipeline– Q3: Canadian perspectives on the Arctic– Q4: The Lac-Megantic Rail Disaster

Page 26: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Internal Collaboration

• Thematic project topics originated in LAC’s Strategic Research and Policy area

• Each thematic project drew on the expertise of the relevant library and archival subject matter specialists to scope project parameters and develop a collaborative seedlist– Political– Historical– Social & Cultural– Economic– Specialized Media

Page 27: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014
Page 28: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014
Page 29: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014
Page 30: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Government of CanadaOfficial Publications

and Websites

Page 31: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Context

• The Treasury Board Secretariat of Canada’s Web Renewal Action Plan– Consolidates the GC’s Web presence from ~1,500

websites down to one, Canada.ca

• Stakeholders from the GC and the public expressed concern that valuable web resources of enduring value would be lost

• Key stakeholders mobilized to engage LAC and lobby for collaborative web archiving activity

31

Page 32: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Collaboration

• Two major stakeholders, the Universities of Albertaand Toronto, run their own harvesting programmes via Archive-It

• Collaborative work proceeded immediately to identify and prioritize some 3,000 government websites for harvesting by LAC

• Stakeholder expectations, advice, and extant seedlists directly factored in the methodology and curation of the LAC 4th domain crawl project

• Began in September 2013, and is currently in QA

Page 33: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

LAC’s Web Harvesting:Current Status

• LAC began a 4th crawl of the Government ofCanada web domain in Sept 2013:– Official Languages Act; TBS Directive on Web

Accessibility– Data collection outsourced to Internet Archive’s

“Archive-It” service– Data will be returned to LAC and made

accessible via an upgraded GCWA

Page 34: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

2013-14 GC Domain Harvest:Preliminary Results

• GC websites successfully captured as of May 17th 2014:

• 760+ major departmental websites• QA still ahead of us

Page 35: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Key Issues: Curation

• With capacity for addressing only a handful of thematic topics,which ones get selected for curation?– Which issues that count count the most, and according to whose perspective?– Selective vs. comprehensive, finite versus ongoing

• As more thematic projects are undertaken, sustainability and capacity issues arise– For themes that remain pertinent, conducting update crawls on in order to

update the archive and maintain its currency

• Securing long-term buy-in and resources for the continuing support and development of the web archiving technical infrastructure– E.g., further development for WebCan

• How much time to put into QC?– How much QA is “good enough?”

Page 36: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Conclusion (Key Answers)

• We’re adopted a researcher-centric approach to the construction of thematic collections

• We’ve adopted a govtdocs subject-specialist approach to the collection of the federal government domain and its official publications:– LAC has web archival holdings of the Electronic Depository Services

Program Checklists as of 1995:• http://epe.lac-bac.gc.ca/100/201/301/liste_hebdomadaire/ • http://epe.lac-bac.gc.ca/100/201/301/weekly_checklist/

• Web analytics demonstrate extensive use of the GCWA by Canadian universities, private industry, and provincial, federal, and international governments

Page 37: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

37

Web Archives as Data Set

• LAC’s web archival holdings as:– Open Data

• Assembling web archives with as wide a perspective as possible, with an eye to making them Open Data?

• Potential for addition to the GC Open Data Portal• Several requests already for the CGI-PLN LOCKSS

– Big Data• High potential for governmental, science, policy, financial data and

textual mining; has IM applications• Potential impetus for governmental innovation in information and

services to the Canadian public

Page 38: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Next Steps

• LAC is currently defining its long term businessstrategy and technical requirements for a renewed Web harvesting program for FY 2014-2015

• The GCWA will be updated to provide access to all of LAC’s web archival holdings (~20 TB)– Migration of legacy ARCs to ISO standard WARCs– GCWA will be migrated to a WCAG-compliant GC look and feel

• Construct researcher-centric discovery and search tools

Page 39: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014

Web Harvesting Team @ LAC

Tom SmythManager, Digital [email protected]

Patricia KlambauerLead Web Harvesting [email protected]

Strategic Initiatives and Client Relations Division

Evaluation and Acquisitions Branch

Library and Archives Canada

Page 40: Web Harvesting Collaborations at Library and Archives Canada Tom Smyth Manager, Digital Capacity tom.smyth@bac-lac.gc.ca IIPC GA 2014