Annotating Web archives - structure, provenance, and context through archival cataloguing

<ul><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 1/22</p><p>Technical Note</p><p>Annotating Web archives*</p><p>structure,provenance, and context through</p><p>archival cataloguing</p><p>P. H. J. WU*, A. K. H. HEOK and I. P. TAMSIRNanyang Technological University, 31 Nanyang Link, Singapore 637718</p><p>Despite the success of Internet access via search technology, such ease of access is stillnot available in Web archives, as a greater amount of relevant contextual information isessential in accessing Web archives. The degree of relevance of the contextualinformation has to be customized to suit research on culture and heritage study overtime. Information scientists have long been struggling to find a system that can helpthem organize Web archives so that users can have access to complete and coherentcollections. Lessons can be learned from archivists who have an established tradition oflinking materials to its origin and ownership or what is termed provenance. In this paper,we demonstrate how Web Annotation for Web Intelligence, more than just an intuitiveway of expressing ones thoughts on the materials under study, is in fact an appropriatetool for cataloguing Web archives in order to ensure a high quality of access for users.Informed by the theory of Records Continuum, a demonstration of access to archived</p><p>Web materials will be presented. We then recommend an effective way of allowing thecontinual organization of Web archives based on several design principles for a Webannotation system. This system would preserve the evidence and context of thecataloguing process. Such a tool would also help facilitate collaboration amonginformation professionals in organizing complex Web archives. Implementing therecommended Web annotation system will help ensure better-quality archives withmore evidence and contextual information preserved within the system.</p><p>1. Introduction</p><p>Web users are accustomed to instant access to information with the success ofWeb search technology. However, due to the different versions of websiteskept in a Web archive, a greater effort to catalogue materials in an archive isneeded to accommodate the need for easy access that Web users expect. Therehave been increasing interests in providing a more complex informationarchitecture for leverage, such as taxonomy, metadata, ontology, and theintegration of different modes of access, including searching, browsing, androuting. This paper examines a particular case for accessing Web archiveswhich contains complex materials that can serve distinct communities,including social scientists and historians. We present a perspective in whichwebsites are more than mere publications. They should be seen as evidence of</p><p>*Corresponding author. Email:</p><p>New Review of Hypermedia and Multimedia,Vol. 13, No. 1, July 2007, 55 75</p><p>N R i f H di d M lti di</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 2/22</p><p>the cultural activities of contemporary society. As such, its collection shouldbe managed differently, as an archive would its holdings, preserving thecontextual evidence of its content. In a previous paper (Wu et al. 2006), wedemonstrated a bibliographic approach to cataloguing Web archives and</p><p>showed how metadata produced by Web annotation can serve as points ofaccess to Web archives. In that paper, a short survey of the various libraryWeb archives models around the world also points to a pressing inade-quacy in the available methods of organizing their materials. These usuallyemploy the use of bibliocentric cataloguing that treats each website as anentity without any relationship to the other materials in the collection. Thisis because the contextual and provenancial information of these collections,which are essential for social scientists and historians to understand, are notmade apparent, with much of the information being buried deep within thearchives. A more suitable model being developed is the Arizona Model</p><p>(Pearce-Moses and Kaczmarek 2005), where archival principles of prove-nance and original order are adopted. This approach may prove more usefulin presenting a Web archives holdings to facilitate knowledge discovery. Thetechnological challenge then becomes one of how Web annotation can beeffectively extended to help organize contextual and provenancial relation-ship based on bibliographic metadata. We explained the need for theserequirements with a concrete case in Section 2 from a post-custodianapproach.</p><p>In Section 3, a context-aware Web annotation system, termed the WebAnnotation for Web Intelligence (or WAWI), is introduced. The WAWI Webannotation system ensures the capture of evidence and contextual informa-tion of Web archives catalogue. WAWI is part of a joint project between theNational Library Board of Singapore and Nanyang Technological Universityto catalogue and archive Singapore websites. Before explaining how context-aware annotation works, we will review the difference between context-lessand context-aware systems. Context-less annotation does not provide therelationship between the metadata and the Web content (the context whichthe metadata content is describing). Thus, it is difficult to confirm whetherthe metadata annotated is consistent with the Web content by a third partywho was not involved in the original annotation. Without such verification,the evidence or selected parts of the Web content used to annotate the</p><p>metadata cannot be corroborated with the annotation. This compromises andrenders the annotation unreliable. Context-aware annotation, however,establishes the relationship between the metadata, the content of the Webmaterial and the social context in which the content was produced. A context-aware annotation system can thus help librarians ensure the quality of therecords more effectively by being able to:</p><p>. relate semantic content in the metadata to Web content;</p><p>. render agreement, disagreement, and different granularity of evidence;</p><p>. provide flexible and precise annotation of the evidence;</p><p>. relate ontology to metadata in relational metadata.</p><p>56 P. H. J. Wu et al.</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 3/22</p><p>2. Post-custodian approach to Web archives cataloguing</p><p>The tagging movement allows actors other than the creator of the Webmaterials to structure meaning into the materials. This collaborative approachin organizing information has been shared by professional archivists in thedevelopment of the Records Continuum Theory (RCT) for organizing recordsand archives (Upward 1998). RCT challenges the custodial role of thearchives. It advocates that, in a post-custodial paradigm, archivists mustbecome more than mere physical caretakers and take on the role ofidentifying, controlling, and making electronic records continually accessibleto society at large. As professionals in preserving information, archivistsshould take as much care in the cataloguing of its active holdings tofacilitate access of public records as it does in preserving it. Similarly, in thecontext of a Web archive, the Web archivist should take on a more proactiverole in transforming the Web archive into one that allows for greater and</p><p>easier access to its materials. In the current Web environment, public userscould also be encouraged to collaboratively help make sense of informal Webmaterials that are being preserved, as exemplified by the participants of thetagging movement.</p><p>In an attempt to illustrate how contextually organized materials canfacilitate access to holdings in a Web archive, we shall use the example of thewebsite of the Ministry of Manpower (MOM) in Singapore (</p><p>The MOMs mission is to achieve a globally competitive workforce and agreat workplace for a cohesive society and a secure economic future for all</p><p>Singaporeans. One of the ways it sets out to accomplish this aim is the settingup of an Occupational Safety and Health (OSH) Division that promotes OSHat the national level. It works with employers, employees, and all otherstakeholders to identify, assess, and manage workplace safety and health risksso as to eliminate death, injury, and ill health. The department within theOSH Division focusing on the reduction of safety and health hazards is theOSH Inspectorate. It does so by providing advice and guidance throughinspections of workplaces, investigating accidents and enforcing the relevantlaws. The hierarchical relationship between the various offices can be foundon the interactive government online directory at A</p><p>snapshot of the relevant page is presented in figure 1.In a typical work process like the communication of information to thepublic with regards to an industrial accident, both the division in charge ofthe policy area (OSH in this case) and the corporate communicationsdepartment (CCD) would put up a joint draft which goes through the PS tothe Minister for approval depending on the nature of the subject to beannounced. Such cross-divisional collaboration means that the filing of thedrafting and approval process would be kept at both divisions with OSHholding a series of case files relating to a particular subject/case (e.g.Industrial accidents, public education on occupational safety issues, reportson occupational health, etc.). These case files involve all the drafts that tookplace for submission up to the divisional director and CCD containing all</p><p>Annotating Web archives 57</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 4/22</p><p>drafts of press releases they receive from each divisional director and thesubsequent changes after vetting by the bureaucratic and political masters.However, because all Web communication comes under the purview of CCD,information based on the Web should be filed under CCD. To facilitate the</p><p>different categories of CCDs work, the materials are divided into events,</p><p>Figure 1. Organizational chart as reflected in the Singapore Government Directory interactive.</p><p>58 P. H. J. Wu et al.</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 5/22</p><p>marketing, public education, publication, press releases, speeches, etc., andthese are further subdivided by subject area, division, or departments whichmirror the organization chart.</p><p>Following an archival arrangement of materials, the MOM fond would</p><p>contain all fonds of the various divisions and files of the differentdepartments as presented in figure 2. In the case of an industrial accident,the department most intimately involved would be the Investigation Branchunder the OSH Inspectorate Department which comes under the Occupa-tional Safety &amp; Health Division.</p><p>Here is a scenario of how a public policy scholar might examine how theMinistry of Manpower in Singapore handled an industrial accident,specifically the Nicoll Highway Collapse Incident. Being an industrialaccident, the OSH Inspectorate was the agency legislated to overseeinvestigations. To review the events from the governments point of view,</p><p>the scholar can visit the OSH group of documents. He will be pointed to filescontaining the various public communication activities ( files include speeches by the minister (in parliament for the amend-ment of the Factories Act), commission reports, press release, and even aFrequently Asked Questions (FAQ). However, these files may not all beavailable from the current website. This is because when events unfold, theimportance of information emanating from the government may change. Thischange can be seen by comparing the websites now and then in figure 3aand 3b.</p><p>For example, the section on FAQ, one of the key documents available in2004 to help the public understand and interpret the information on the site,was missing by 2006. All the helpful information is now no longer available at</p><p>Figure 2. MOMs organization chart derived from the Singapore Government Directory</p><p>interactive.</p><p>Annotating Web archives 59</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 6/22</p><p>the live MOM website. The researcher will now no longer be able to learn viathe FAQ how the reports were being made and about the various degrees ofcommissions that the government appointed.</p><p>However, with the creation of a Web archive where such materials areorganized into collections, and the arrangement of records made possibleusing annotation tools, changes in public communication patterns can bemade more apparent. Not only will researchers benefit from being able toaccess evidence of changing trends, but so will ordinary citizens who wantto find out about the accident at a later date. In addition, by relating the filesto each other, one also discovers not only that MOM was involved but thatthe Ministry of National Development (MND) and the Building and</p><p>Construction Authority (BCA) were also involved in offering joint reportson the event. Their insights help to mould new policies that came out of suchreports and led to the creation of a new OSH Framework.</p><p>With these, we observe that context-aware Web annotation is not onlyimportant for the current use but even more crucial for the lasting value ofheritage and cultural value of Web materials. It is also important fororganizing Web materials as records to be carried across time (Wu and Theng2005, Wu and Heok 2006). Most of the current approaches surveyed in ourlast paper (Wu et al. 2006) on Web archives cataloguing have fallen short ofthe requirements to provide evidential and contextual organization to</p><p>facilitate effective access.</p><p>Figure 3. (a) MOM circa 2004 from Web Archives, with FAQ. (b) MOM circa 2006 in the</p><p>current website, without FAQ.</p><p>60 P. H. J. Wu et al.</p></li><li><p>8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing</p><p> 7/22</p><p>3. Web annotation system in service of Web archive cataloguing</p><p>As demonstrated in section 2, a context-aware Web annotation system canfacilitate effective information discovery. In this section, we introduce theWeb Annotation for Web Intelligence (WAWI) system. We will alsodemonstrate how four design principles are implemented to achieve theobjectives of preserving the evidence and context in cataloguing andarranging Web archives. They need to be able to:</p><p>. relate semantic content in the metadata to the Web content;</p><p>. render agreement, disagreement, and different granularities of evidence;</p><p>. provide flexible and precise annotation of the evidence;</p><p>. relate ontology to metadata in relational metadata.</p><p>The WAWI annotation system is integrated with the Web archiving platform</p><p>developed by International Internet Preservation Consortium (IIPC) ( which comprises Web harvesting andaccess components (Heritrix URL:; NutchWaxURL: http://archive-access.sourcef...</p></li></ul>