Transcript
  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    1/22

    Technical Note

    Annotating Web archives*

    structure,provenance, and context through

    archival cataloguing

    P. H. J. WU*, A. K. H. HEOK and I. P. TAMSIRNanyang Technological University, 31 Nanyang Link, Singapore 637718

    Despite the success of Internet access via search technology, such ease of access is stillnot available in Web archives, as a greater amount of relevant contextual information isessential in accessing Web archives. The degree of relevance of the contextualinformation has to be customized to suit research on culture and heritage study overtime. Information scientists have long been struggling to find a system that can helpthem organize Web archives so that users can have access to complete and coherentcollections. Lessons can be learned from archivists who have an established tradition oflinking materials to its origin and ownership or what is termed provenance. In this paper,we demonstrate how Web Annotation for Web Intelligence, more than just an intuitiveway of expressing ones thoughts on the materials under study, is in fact an appropriatetool for cataloguing Web archives in order to ensure a high quality of access for users.Informed by the theory of Records Continuum, a demonstration of access to archived

    Web materials will be presented. We then recommend an effective way of allowing thecontinual organization of Web archives based on several design principles for a Webannotation system. This system would preserve the evidence and context of thecataloguing process. Such a tool would also help facilitate collaboration amonginformation professionals in organizing complex Web archives. Implementing therecommended Web annotation system will help ensure better-quality archives withmore evidence and contextual information preserved within the system.

    1. Introduction

    Web users are accustomed to instant access to information with the success ofWeb search technology. However, due to the different versions of websiteskept in a Web archive, a greater effort to catalogue materials in an archive isneeded to accommodate the need for easy access that Web users expect. Therehave been increasing interests in providing a more complex informationarchitecture for leverage, such as taxonomy, metadata, ontology, and theintegration of different modes of access, including searching, browsing, androuting. This paper examines a particular case for accessing Web archiveswhich contains complex materials that can serve distinct communities,including social scientists and historians. We present a perspective in whichwebsites are more than mere publications. They should be seen as evidence of

    *Corresponding author. Email: [email protected]

    New Review of Hypermedia and Multimedia,Vol. 13, No. 1, July 2007, 55 75

    N R i f H di d M lti di

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    2/22

    the cultural activities of contemporary society. As such, its collection shouldbe managed differently, as an archive would its holdings, preserving thecontextual evidence of its content. In a previous paper (Wu et al. 2006), wedemonstrated a bibliographic approach to cataloguing Web archives and

    showed how metadata produced by Web annotation can serve as points ofaccess to Web archives. In that paper, a short survey of the various libraryWeb archives models around the world also points to a pressing inade-quacy in the available methods of organizing their materials. These usuallyemploy the use of bibliocentric cataloguing that treats each website as anentity without any relationship to the other materials in the collection. Thisis because the contextual and provenancial information of these collections,which are essential for social scientists and historians to understand, are notmade apparent, with much of the information being buried deep within thearchives. A more suitable model being developed is the Arizona Model

    (Pearce-Moses and Kaczmarek 2005), where archival principles of prove-nance and original order are adopted. This approach may prove more usefulin presenting a Web archives holdings to facilitate knowledge discovery. Thetechnological challenge then becomes one of how Web annotation can beeffectively extended to help organize contextual and provenancial relation-ship based on bibliographic metadata. We explained the need for theserequirements with a concrete case in Section 2 from a post-custodianapproach.

    In Section 3, a context-aware Web annotation system, termed the WebAnnotation for Web Intelligence (or WAWI), is introduced. The WAWI Webannotation system ensures the capture of evidence and contextual informa-tion of Web archives catalogue. WAWI is part of a joint project between theNational Library Board of Singapore and Nanyang Technological Universityto catalogue and archive Singapore websites. Before explaining how context-aware annotation works, we will review the difference between context-lessand context-aware systems. Context-less annotation does not provide therelationship between the metadata and the Web content (the context whichthe metadata content is describing). Thus, it is difficult to confirm whetherthe metadata annotated is consistent with the Web content by a third partywho was not involved in the original annotation. Without such verification,the evidence or selected parts of the Web content used to annotate the

    metadata cannot be corroborated with the annotation. This compromises andrenders the annotation unreliable. Context-aware annotation, however,establishes the relationship between the metadata, the content of the Webmaterial and the social context in which the content was produced. A context-aware annotation system can thus help librarians ensure the quality of therecords more effectively by being able to:

    . relate semantic content in the metadata to Web content;

    . render agreement, disagreement, and different granularity of evidence;

    . provide flexible and precise annotation of the evidence;

    . relate ontology to metadata in relational metadata.

    56 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    3/22

    2. Post-custodian approach to Web archives cataloguing

    The tagging movement allows actors other than the creator of the Webmaterials to structure meaning into the materials. This collaborative approachin organizing information has been shared by professional archivists in thedevelopment of the Records Continuum Theory (RCT) for organizing recordsand archives (Upward 1998). RCT challenges the custodial role of thearchives. It advocates that, in a post-custodial paradigm, archivists mustbecome more than mere physical caretakers and take on the role ofidentifying, controlling, and making electronic records continually accessibleto society at large. As professionals in preserving information, archivistsshould take as much care in the cataloguing of its active holdings tofacilitate access of public records as it does in preserving it. Similarly, in thecontext of a Web archive, the Web archivist should take on a more proactiverole in transforming the Web archive into one that allows for greater and

    easier access to its materials. In the current Web environment, public userscould also be encouraged to collaboratively help make sense of informal Webmaterials that are being preserved, as exemplified by the participants of thetagging movement.

    In an attempt to illustrate how contextually organized materials canfacilitate access to holdings in a Web archive, we shall use the example of thewebsite of the Ministry of Manpower (MOM) in Singapore (www.mom.gov.sg).

    The MOMs mission is to achieve a globally competitive workforce and agreat workplace for a cohesive society and a secure economic future for all

    Singaporeans. One of the ways it sets out to accomplish this aim is the settingup of an Occupational Safety and Health (OSH) Division that promotes OSHat the national level. It works with employers, employees, and all otherstakeholders to identify, assess, and manage workplace safety and health risksso as to eliminate death, injury, and ill health. The department within theOSH Division focusing on the reduction of safety and health hazards is theOSH Inspectorate. It does so by providing advice and guidance throughinspections of workplaces, investigating accidents and enforcing the relevantlaws. The hierarchical relationship between the various offices can be foundon the interactive government online directory at http://www.sgdi.gov.sg/. A

    snapshot of the relevant page is presented in figure 1.In a typical work process like the communication of information to thepublic with regards to an industrial accident, both the division in charge ofthe policy area (OSH in this case) and the corporate communicationsdepartment (CCD) would put up a joint draft which goes through the PS tothe Minister for approval depending on the nature of the subject to beannounced. Such cross-divisional collaboration means that the filing of thedrafting and approval process would be kept at both divisions with OSHholding a series of case files relating to a particular subject/case (e.g.Industrial accidents, public education on occupational safety issues, reportson occupational health, etc.). These case files involve all the drafts that tookplace for submission up to the divisional director and CCD containing all

    Annotating Web archives 57

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    4/22

    drafts of press releases they receive from each divisional director and thesubsequent changes after vetting by the bureaucratic and political masters.However, because all Web communication comes under the purview of CCD,information based on the Web should be filed under CCD. To facilitate the

    different categories of CCDs work, the materials are divided into events,

    Figure 1. Organizational chart as reflected in the Singapore Government Directory interactive.

    58 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    5/22

    marketing, public education, publication, press releases, speeches, etc., andthese are further subdivided by subject area, division, or departments whichmirror the organization chart.

    Following an archival arrangement of materials, the MOM fond would

    contain all fonds of the various divisions and files of the differentdepartments as presented in figure 2. In the case of an industrial accident,the department most intimately involved would be the Investigation Branchunder the OSH Inspectorate Department which comes under the Occupa-tional Safety & Health Division.

    Here is a scenario of how a public policy scholar might examine how theMinistry of Manpower in Singapore handled an industrial accident,specifically the Nicoll Highway Collapse Incident. Being an industrialaccident, the OSH Inspectorate was the agency legislated to overseeinvestigations. To review the events from the governments point of view,

    the scholar can visit the OSH group of documents. He will be pointed to filescontaining the various public communication activities (http://www.mom.gov.sg/NewOSHFrameworkandInvestigationsonNicollHighwayCollapse).These files include speeches by the minister (in parliament for the amend-ment of the Factories Act), commission reports, press release, and even aFrequently Asked Questions (FAQ). However, these files may not all beavailable from the current website. This is because when events unfold, theimportance of information emanating from the government may change. Thischange can be seen by comparing the websites now and then in figure 3aand 3b.

    For example, the section on FAQ, one of the key documents available in2004 to help the public understand and interpret the information on the site,was missing by 2006. All the helpful information is now no longer available at

    Figure 2. MOMs organization chart derived from the Singapore Government Directory

    interactive.

    Annotating Web archives 59

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    6/22

    the live MOM website. The researcher will now no longer be able to learn viathe FAQ how the reports were being made and about the various degrees ofcommissions that the government appointed.

    However, with the creation of a Web archive where such materials areorganized into collections, and the arrangement of records made possibleusing annotation tools, changes in public communication patterns can bemade more apparent. Not only will researchers benefit from being able toaccess evidence of changing trends, but so will ordinary citizens who wantto find out about the accident at a later date. In addition, by relating the filesto each other, one also discovers not only that MOM was involved but thatthe Ministry of National Development (MND) and the Building and

    Construction Authority (BCA) were also involved in offering joint reportson the event. Their insights help to mould new policies that came out of suchreports and led to the creation of a new OSH Framework.

    With these, we observe that context-aware Web annotation is not onlyimportant for the current use but even more crucial for the lasting value ofheritage and cultural value of Web materials. It is also important fororganizing Web materials as records to be carried across time (Wu and Theng2005, Wu and Heok 2006). Most of the current approaches surveyed in ourlast paper (Wu et al. 2006) on Web archives cataloguing have fallen short ofthe requirements to provide evidential and contextual organization to

    facilitate effective access.

    Figure 3. (a) MOM circa 2004 from Web Archives, with FAQ. (b) MOM circa 2006 in the

    current website, without FAQ.

    60 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    7/22

    3. Web annotation system in service of Web archive cataloguing

    As demonstrated in section 2, a context-aware Web annotation system canfacilitate effective information discovery. In this section, we introduce theWeb Annotation for Web Intelligence (WAWI) system. We will alsodemonstrate how four design principles are implemented to achieve theobjectives of preserving the evidence and context in cataloguing andarranging Web archives. They need to be able to:

    . relate semantic content in the metadata to the Web content;

    . render agreement, disagreement, and different granularities of evidence;

    . provide flexible and precise annotation of the evidence;

    . relate ontology to metadata in relational metadata.

    The WAWI annotation system is integrated with the Web archiving platform

    developed by International Internet Preservation Consortium (IIPC) (http://www.netpreserve.org/about/index.php) which comprises Web harvesting andaccess components (Heritrix URL: http://crawler.archive.org/; NutchWaxURL: http://archive-access.sourceforge.net/projects/nutch/; Wera URL: http://archive-access.sourceforge.net/projects/wera/): Heritrix, Nutchwax, and Wera.The system architecture resulting from the incorporation of annotation in thecataloguing process is shown in figure 4.

    Figure 4. WAWI annotation and cataloguing system integrated with the IIPC Web Archiveplatform.

    Annotating Web archives 61

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    8/22

    Details and a demonstration of WAWI system are discussed in section 3.5.From sections 3.1 to 3.4, we shall focus on the design principles of WAWIsystem and shall reference Annotea (Kahan et al. 2001) and CREAM(Handschuh et al. 2001) as model systems.

    3.1 Relating semantic content of the metadata to Web content

    As briefly mentioned in section 1, there are two different kinds of annotationsystems: one provides the relationship between the semantic content of themetadata, and the other does not.

    Examples of context-less annotation systems developed in the Webarchiving systems community can be found in Schneider et al. (2002) andLampos et al. (2004). In Schneider et al. (2002), annotated metadata wereused for browsing; in Lampos et al. (2004), it was meant to be implemented

    as an automatic tagging system.Context-aware annotation establishes the relationship between the meta-

    data and the content of Web material. The Annotea project in the WWWSemantic Web Consortium is an example of a context-aware system (Kahanet al. 2001). It provided a relationship between the semantic and thedocument content through its two properties: annotates and context inthe namespace (defined at http://www.w3.org/2000/10/annotation-ns#). TheWAWI annotation system adopted the Annotation Graph schema (Bird andLiberman 1999). The resulting XML document fragments of those high-lighted in figure 5 are presented below:

    Bannoschema id0{GUID0} datecreated023 09 2005createdby0ichsan type0ontology datemodified0

    modifiedby0

    url0http://app.sgdi.go v.sg/listing.asp?agency_subtype0dept&

    agency_id00000000011

    BDivision Title0OrganizationHealthSafty id0{GUID1}

    begin0566 end0577 value0 Organizational Health and

    Safety meta0Organizational Safety and Health

    B/Division

    BDivision Title0ForeignManpwer id0{GUID2} begin0987

    end01004v

    alue0Foreign Manpower Policy meta0Foreign

    Manpower Policy

    B/Division

    B/annoschema

    Each annotation schema contains several annotation attributes and elements.The id attribute contains the system generated unique id for the schema; theurl attribute denotes the Web page that is annotated as support of theschema; other self-explanatory attributes include datecreated, datemodi-fied, modifiedby and createdby.

    62 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    9/22

    Each annotation element, such as Division, contains a begin and anend attribute, whose values are the page coordinates (see discussion inSection 3.3) of the text portion of the DOM tree of the Webpage. The valueattribute contains value as the text of the Webpage that is delimited by thebegin and end page coordinates, which was highlighted as evidence (orcontext in Annoteas term). The meta attribute contains the metadata thatare assigned to the element that was supported by the evidence. In theMOM example discussed earlier, we created the annotation schema, anontology, that relates to the MOM organization chart found in the SGDiwebsite.

    3.2 Rendering agreement, disagreement, and different granularity of evidence

    In Annotea, annotations are simply rendered as pencil symbols (Kahan et al.2001). The pencil symbol model is limited, as it can only indicate the startingpoint, not the extent of the annotation. On the other hand, the AG model ofannotation in WAWI encompasses the whole extent of the annotation. Whendisagreement and different granularities of evidence occur, various over-lapping patterns of the extent will result. This is where the need for rendering

    complex patterns of annotation comes in.

    Figure 5. Annotation schema, an ontology reflecting the MOM organization chart, and its

    supporting Web page at the Singapore Government Directory interactive (SGDi) (only

    partially shown).

    Annotating Web archives 63

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    10/22

    As demonstrated in figure 6, two disagreeing metadata records are shown bythe overlapped annotation (evidence) of the OSH vision. With the highlightedpatterns, the metadata records can then be verified and consolidated to aunified and agreeable metadata record as discussed in Section 1.

    3.3 Providing flexible and precise annotation of the evidence

    Annotea uses Xpointer to define how annotation is related to thedocument. The location of the annotated text in the document is

    represented by Xpath. It uses the page element structure to point to aspecific part of the document. However, Xpointer can only point to the textat the element boundary; it does not point to a specific text position. InAnnotea, the annotation does not include the extent of the annotation andis unable to point to the part that contains the cross-boundary element. Inour WAWI annotation system, the page coordinate approach was developedto provide these features. The page coordinate approach works by serial-izing the document as a sequence of text by omitting the document elementstructure. With this sequence of text as a coordinate, the precise positionand extent of an annotation are recorded at the start and end positions of

    the text in the document.

    Figure 6. Multiple-evidence overlapping annotations in WAWI.

    64 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    11/22

    3.4 Relate ontology to metadata in relational metadata

    As shown in figure 7, the OSH archived Webpages circa 2004 have threemetadata records corresponding to Speech, Press Releases, and FAQ files ofOSH. The FAQ metadata record for the FAQ files of the Web page in figure 8is demonstrated below; the url and datecreated attributes indicate that itwas archived in 2004:

    Bannoschema id0{guid} type0metadata

    datecreated023 09 2004 datemodified023 09 2004

    createdby0ichsan modifiedby0ichsan

    url0http://web.archives/2004/www.mom.gov.sg/OSHD/

    Bref

    Bnodeid{GUID17}B/nodeid

    Figure 7. Speech, press release, and FAQ files in the Web archives of Occupation Safety and

    Health (OSH) Division of MOM circa 2004.

    Annotating Web archives 65

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    12/22

    BnodenameFAQ FileB/nodename

    B/ref

    BannoElements

    BTitle id01 begin034 end063 value0Nicoll

    Highway Investigations meta0

    Industrial AccidentB/Title

    BSubject id04 begin0752 end0777

    value0Frequently Asked Questions meta0 Frequently

    Asked Questions

    B/Subject

    B/annoElements

    B/annoschema

    Note that the additional Bref element, like the CREAM Brefattribute (Kahan et al. 2001), provides a pointer to the ontology FAQ File

    with {GUID17}. This relates to the additional relational metadata that linkthe metadata to the ontology. As shown in figure 6, each node of the ontology(displayed on the left-hand frame) has its corresponding metadata, which aredisplayed on the right-hand frame. The referring path to the FAQ File nodeabove is then: MOM 0 OccupationalHealthSafety 0 OSHInspectorate 0IndustrialAccidents 0 NicollHighwayCollapse 0 FAQ.

    Figure 8. FAQ node in the MOM ontology and its linking metadata record.

    66 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    13/22

    The View Page button allows the user to see the related Web page with themetadata and the evidence shown in figure 8. The ontology remains the samefor the current Web archives in 2005 (figure 3b). As discussed in section 2,despite the fact that there is no FAQ in the current website, a user accessing it

    is still able to depend on its corresponding ontology to access the archivedFAQ Web materials, from 2004. This access allows the user to research thevarious cultural and heritage concerns, including how Singapores MOMconducted its public education programme on the public hearing of the OSHInspectorates committee reports.

    3.5 WAWI Cataloguing and Annotation System for Web Archives

    An overview of the WAWI system is given here, followed by a demonstrationin the following sections. The WAWI annotation and cataloguing system

    works with annotation schema defined by the schema creator. A librarian in alibrary or the moderator in a user community can create schemas for bothontology and metadata. Annotation schema is represented as an XMLdocument. It is platform-independent, and supports the hierarchical structure(multi-level), whereby one node can be drilled down to several sub-nodes formore detailed annotation. Using Dublin Core elements as an example, Datecan be drilled down to CreatedDate and IssuedDate; Coverage can be drilleddown to Temporal and Spatial. The annotation schema then serves as atemplate for cataloguing of Web pages and annotation of the evidence.Section 3.5.1 further explains the annotation schema.

    For public records or materials that require specialized knowledge toorganize, professional cataloguers may be enlisted. Otherwise, for materialsthat do not require specific skills or knowledge, users in a community cancatalogue the websites and provide the evidence for cataloguing using theWAWI system. Librarians, managers, or moderators who are required toensure the consistency and quality of the catalogue will then confirm thisevidence against the catalogue records. Section 3.5.2 further explains theprocess and ways to administer and use the WAWI cataloguing andannotation system. The result of the annotation is captured by filling specificvalues in the template based on the annotation schema.

    3.5.1 WAWI annotation schema. Conceptually, the WAWI annotation schemaconsists of the following major components:

    . annotation title;

    . annotated text (or the evidence);

    . user input and comment (or the metadata value);

    . permission or access rights.

    Other information specified in annotation schema include id, name, date-created, datemodified, createdby, modifiedby, editable, and url. All of this

    information is found in the attributes of annotation schema. The id attributeis a unique GUID which is system-generated.

    Annotating Web archives 67

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    14/22

    Each element contains begin , end, value, and meta attributes. Begin andend attributes are used to denote the page coordinates of the annotation in aWeb page. The annotated text is stored in the value attribute, and the metaattribute is stored with the metadata, input by users. The text in the value

    attribute then serves as evidence to the metadata in meta attribute. Thefollowing XML document is a sample annotation schema that containsProduct, Company, and Price information and further details under them:

    B?xml version01.0 encoding0utf-8?

    Bannoschema id0{guid} type0metadata name0CatalogueTask

    datecreated023 09 2005 datemodified023 09 2005

    createdby0ichsan modifiedby0ichsan editable0yes

    url0

    Bproduct id01 begin0 end0 value0 meta0

    Bcategory id02 begin0 end0 value0

    meta0B/category

    Bmodel id03 begin0 end0 value0 meta0B/model

    Bname id04 begin0 end0 value0 meta0B/name

    B/product

    Bcompany id05 begin0 end0 value0 meta0

    Bname id06 begin0 end0 value0 meta0B/name

    Baddress id07 begin0 end0 value0 meta0

    Bbuilding id0 begin0 end0 value0

    meta0B/building

    Bpostalcode id010 begin0 end0 value0

    meta0

    BpostalcodeB/address

    B/company

    Bprice id011 begin0 end0 value0 meta0/

    B/annoschema

    3.5.2 Cataloguing and annotation process based on the WAWI system. Overall, theflow of the system is divided into three stages. The first stage is the schemapreparation stage. An annotation schema is created and saved in an XMLdatabase. The annotation schema can be modified. The second stage is theannotation process. At this stage, the annotation schema will be loaded on the

    browser, together with the target Web page to be catalogued and annotated,which is retrieved from the archives repository. By clicking and dragging, thetargeted portion of the text under consideration is highlighted and capturedin the annotation schema template. After users have finished with theannotation, the annotation will be saved at the server side for retrieval andverification of catalogue records subsequently.

    The third stage is to search and retrieve the metadata and evidence fromprevious cataloguing and annotation process and to confirm the metadataagainst the evidence in the catalogue records.

    68 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    15/22

    For a better understanding of the system, the context and use-casediagrams of the WAWI cataloguing and annotation process are given infigures 9 and 10. The context diagram shows that there are three types ofactors interacting with the annotation system. The Librarian is the actor whocreates, edits, and deletes an annotation schema. The Librarian can alsocatalogue and annotate the Web archive materials.

    The cataloguer is the actor who annotates the Web pages based on theannotation schema created by Librarian. They can also retrieve and view theannotations and modify them. The last actor is the Manager. They can viewthe annotations done by cataloguers, and generate a report from theannotations data. The detail processes and interaction of each actor inthe annotation system is shown in the use-case diagram above.

    These annotation data can be used by the Manager actor to producemanagement reports. We shall discuss the detail of the reporting system andreports in a separate paper.

    3.5.3 System demonstration. Based on the description in section 3.5.2, thesystem demonstration is divided into three parts: (1) Schema Preparation, (2)Annotation/Cataloguing Process, and (3) Retrieval and Verification. Thesystem is implemented with Web-based client/server architecture. At the clientside, it only requires Web browser with JavaScript enabled. The server siderequires Web server (Apache), and programming of database server (BerkeleyXML Database) and serverlet container (Tomcat).

    3.5.3.1 Schema preparation

    Librarian actors use Annotation Schema Manager to create annotationschema. As shown in figure 11, the annotation schema is represented in Tree

    Web Annotation System

    Librarian

    Cataloguer

    Managers

    Create/Edit/Delete Schema

    Annotate based on theschema

    (Annotation Result)

    ReportsAnnotated Page

    Annotated Page

    Context Diagram

    Annotate based on the schema

    (Annotation Result)

    Figure 9. Context diagram of the WAWI annotation and cataloguing system.

    Annotating Web archives 69

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    16/22

    view. Librarians can click Save Schema button to indicate that they havefinished creating or modifying the schema. Then, the system will convert thistree view to XML document and store it in the database.

    3.5.3.2 Annotation/cataloguing process

    The cataloguer actor uses the Annotation Panel to annotate Web pages. Asshown in figure 12, the panel has two frames. The left-hand frame is used todisplay the Web page, while the one on the right-hand is used to display theannotation schema. In the right-hand frame, annotation schema stored in theXML database will be retrieved when the user selects it from the dropdownlist and sends it to the client as an XML document. The XML document isconverted to a DOM object and rendered in a Tree view. Next to a tree nodeis a textbox meant for users to enter the metadata. The annotation evidence

    Create Schema

    Modify Schema

    Delete Schema

    Librarian

    View Annotated Page

    Retrieve Schema

    Annotate Page

    Save Annotation

    Generate Report

    Cataloguer

    Manager

    Retrieve Annotated

    Page

    Use-Case diagram

    Figure 10. Use-case diagram of the WAWI annotation and cataloguing system.

    70 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    17/22

    will be automatically extracted and copied to the textbox, and users are freeto change the value in the textbox. The left-hand frame is used to display thearchived Web page and the associated annotation. The Web page displayedcan be further annotated using devices available in the right-hand frame.Lastly, during the verification stage, the left-hand frame will also display thedifferent overlapping effects of evidence. As shown in figure 13, there are twoopposing annotations entered by two cataloguer actors in the right-handframe. The left-hand frame will then render overlapping and non-overlappinghighlighted text indicating how the disagreement may be initiated by theevidence applied.

    3.5.3.3 Relate the metadata to the ontology

    Metadata and ontology are related to the Bref element in metadataannotation schema, as demonstrated in figure 14. At the user interface, thereis a ref node, which consists of nodeid and nodename nodes, whereby theuser can relate these metadata to the specific node of the ontology. Clickingon the . . . button, next to the nodeid node, will bring up the ontologywindow, and the user is able to select which node of the ontology thesemetadata will relate to.

    Figure 11. Annotation schema manager.

    Annotating Web archives 71

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    18/22

    3.5.3.4 Metadata and evidence search

    All the metadata and evidence captured during the annotation process can besearched. The search engine will display all the fields of the search results that

    correspond to the search parameters. The search is translated into an XQueryquery to the XML Database.As shown in figure 15, the search panel is divided into two frames. The left

    frame displays all the available fields to search, and the textbox for user toenter search keywords in the respective field. When the user clicks on theSearch button, the system will perform the search and display the results inthe right frame.

    In the result frame, we can see a link at the Title column. This link willbring the user to the archived Web page and its associated annotation asshown in figure 13.

    The search function in the search panel can be easily extended to perform abrowse function. An integrated search on Web archive materials and

    Figure 12. Annotation process and result on a Web page.

    72 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    19/22

    Figure 13. Overlap annotation reflecting disagreeing evidence in collaborative cataloguing.

    Figure 14. Relating the metadata with ontology.

    Annotating Web archives 73

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    20/22

    metadata may be useful at times. To achieve this, the free text and URLsearch available in WERA via NutchWax can be integrated into the WAWI

    metadata search engine.

    4. Conclusion

    Cataloguing is a timeless and fundamental practice for organizing informa-tion regardless of the types of materials. However, the growth of the Internetcontinues to outpace attempts to describe it. With the help of Internettechnologies and the WAWI system proposed, it is hoped that morecollaborative efforts among information professionals and even the publiccan be effectively mobilized to help catalogue the Web. One of the most

    intuitive methods to transform the Web into one that allows greaterinteraction between systems is through Web annotation. This paper proposesa context-aware Web annotation system which can provide evidence andpreserve context to the catalogued records of the materials within a Webarchive. It enumerates how such a system can help archivists ensure thequality of the records by being able to:

    . relate semantic content in the metadata to Web contents;

    . render agreement, disagreement and different granularities of evidence;

    . provide flexible yet precise annotation of the evidence;

    . relate ontology to metadata in a relational metadata.

    Figure 15. Search and search-result user interface.

    74 P. H. J. Wu et al.

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    21/22

    Such a system is also congruent with the tagging movement, such as Technorati,Flickr, and del.icio.us, which itself reflects a growing trend that tries to leveragecollective efforts to organize materials on the Internet. A context-awareannotation system will facilitate the assurance of quality of materials being

    organized in a Web archive where the working behind how a decision was takento annotate Web materials is made visually obvious, and an inconsiste-ncy resolution mechanism like those found in Wikipedia can be invoked toresolve discrepancies immediately or reserve them for future resolution.

    A review of existing Web archive cataloguing and access practices wascarried out to assess whether the WAWI Web annotation system wascomparable in providing state-of-the-art ways of organizing Web archivesmaterials. By linking Web-archived and current materials via an ontology, wealso concretely demonstrated how better quality access can be achieved tofacilitate a historical understanding of a governments handling of accidents

    on a national scale. With evidence and context annotation in the cataloguingprocess, the collaborative efforts of a community of users and archivists tomaintain the catalogue are facilitated. This effectively opens up new horizonsof creating a Web archive that is at once more research-oriented and flexiblein its approach, and copes with the changing needs of users. All these areachieved with the archive still remaining robust enough to present its holdingsmeaningfully through time.

    References

    S. Bird and M. Liberman, Annotation Graphs as a Framework for Multidimensional Linguistic Data

    Analysis, in Proceedings of the ACL 99 Workshop Towards Standards and Tools for Discourse Tagging,

    College Park, MD, 21 June 1999, pages 1 10.

    S. Handschuh, S. Staab and A. Maedche, CREAM *Creating relational metadata with a component-

    based, ontology-driven annotation framework, in Workshop on Knowledge Markup and Semantic

    Annotation at the First International Conference on Knowledge Capture (K-CAP2001) , Victoria, BC,

    Canada.

    J. Kahan, M.R. Koivunen, E. PrudHommeaux and R. Swick, Annotea: An Open RDF Infrastructure for

    Shared Web Annotations, WWW10 , 1 5 May 2001.

    C. Lampos, M. Eirinaki, D. Jevtuchova and M. Vazirgiannis, Archiving the Greek Web. 2004 Available

    online at: http://ww.iwaw.net/04/proceedings/Lampos.pdf (accessed 5 June 2006)

    R. Pearce-Moses and J. Kaczmarek, An Arizona Model for Preservation and Access of Web Documents.

    DttP: Documents to the People. 33:1. p.17 24. 2005.

    S. Schneider, K. Foot, M. Kimpton and G. Jones, Building Thematic Web Collections: Challenges andExperiences from the September 11 Web Archive and the Election 2002 Web Archive, 2002. Available

    online at: http://bibnum.bnf.fr/ECDL/2003/proceedings.php?f0schneider (accessed 5 June 2006)

    F. Upward, Structuration. Theory and Recordkeeping, 1998. Available online at: http://www.sims.

    monash.edu.au/research/rcrg/publications/recordscontinuum/fupp2.html (accessed 5 June 2006)

    P. Wu and A. Heok, Is Web Archives A Misnomer *How Web Archives Can Become Digital Archives?,

    in Proceedings of the Asia-Pacific Conference on Library & Information Education & Practice: Preparing

    Information Professionals for Leadership in the New Age , C. Khoo, D. Singh and A. Chaudhry, 2006, pp.

    298 350.

    P. Wu, I. Tamsir and A. Heok, Applying context-sensitive Web annotation in evidence-based,

    collaborative Web archives cataloguing, in Proceedings of the International Workshop on Archiving

    Web, 2006. Available online at: http://www.iwaw.net/06/PDF/iwaw06-proceedings.pdf

    P. Wu and Y.L. Theng, Weblog Archives: Achieving the recordness of Web archiving, in Proceedings inthe Ninth International Cultural Heritage Informatics Meeting, 21 23 September ICHIM 05, Paris, 2005.

    Annotating Web archives 75

  • 8/2/2019 Annotating Web archives - structure, provenance, and context through archival cataloguing

    22/22


Top Related