context-based aggregation of archival data: the role of authority records in the semantic landscape

22
ORIGINAL PAPER Context-based aggregation of archival data: the role of authority records in the semantic landscape Ricardo Eito-Brun Ó Springer Science+Business Media Dordrecht 2014 Abstract The formal release of Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF) in 2010 added the need to deal with an additional standard to encode context and authority records. But the possibilities of EAC-CPF go beyond the control of authority records and access points, and this standard constitutes a relevant milestone in the definition of a complex archival information system made up of interconnected, cross-linked records. Based on eXtensible Markup Language, EAC-CPF makes possible the design of semantically rich browsing experiences that give access to distributed description of records and to the detailed data of the persons, corporate bodies, or families that created them. This paper presents a collaboration framework for archival information systems that exploits the relationships built between finding aids and shared context and authority records encoded in EAC-CPF. The proposed architecture is built on top of a set of software components that interact using open information retrieval, content aggregation, and semantic data standards. On top of this architecture, different user- oriented solutions can be built to browse and explore the contents of the aggregated collections. One of these applications is a navigational aid or topic map that serves as a semantically enriched access layer and ensures the location of the records held by different archives. The proposed architecture can be applied to solve different information access challenges that require a single point of access to distributed data. It can be deployed or mapped to existing technical architectures to improve the interaction of users with a set of networked repositories. Keywords XTM Á RDF Á Metadata aggregation Á Engineering archives Á Descriptive metadata standards Á EAC-CPF R. Eito-Brun (&) Universidad Carlos III de Madrid, c/Madrid 124, 28030 Getafe, Madrid, Spain e-mail: [email protected] 123 Arch Sci DOI 10.1007/s10502-014-9215-3

Upload: ricardo

Post on 23-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

ORI GIN AL PA PER

Context-based aggregation of archival data: the roleof authority records in the semantic landscape

Ricardo Eito-Brun

� Springer Science+Business Media Dordrecht 2014

Abstract The formal release of Encoded Archival Context for Corporate Bodies,

Persons, and Families (EAC-CPF) in 2010 added the need to deal with an additional

standard to encode context and authority records. But the possibilities of EAC-CPF

go beyond the control of authority records and access points, and this standard

constitutes a relevant milestone in the definition of a complex archival information

system made up of interconnected, cross-linked records. Based on eXtensible

Markup Language, EAC-CPF makes possible the design of semantically rich

browsing experiences that give access to distributed description of records and to

the detailed data of the persons, corporate bodies, or families that created them. This

paper presents a collaboration framework for archival information systems that

exploits the relationships built between finding aids and shared context and

authority records encoded in EAC-CPF. The proposed architecture is built on top of

a set of software components that interact using open information retrieval, content

aggregation, and semantic data standards. On top of this architecture, different user-

oriented solutions can be built to browse and explore the contents of the aggregated

collections. One of these applications is a navigational aid or topic map that serves

as a semantically enriched access layer and ensures the location of the records held

by different archives. The proposed architecture can be applied to solve different

information access challenges that require a single point of access to distributed

data. It can be deployed or mapped to existing technical architectures to improve the

interaction of users with a set of networked repositories.

Keywords XTM � RDF � Metadata aggregation � Engineering archives �Descriptive metadata standards � EAC-CPF

R. Eito-Brun (&)

Universidad Carlos III de Madrid, c/Madrid 124, 28030 Getafe, Madrid, Spain

e-mail: [email protected]

123

Arch Sci

DOI 10.1007/s10502-014-9215-3

Introduction

Archives have made progress in the deployment of Web-based services that provide

access to finding aids and digital representations of the records. The adoption of

tools, software, metadata schemas, and related working procedures has established

the basis for scenarios where archives can leverage cooperative efforts, build

collaborative networks, and figure out innovative information services for the end

users. The work of archivists cannot be understood in isolation, and the possibilities

of reaching descriptions and representations of records held by different archives

through the Web open a new space for researchers, no longer constrained by

geographical location of the records. The adoption of standards to encode, publish,

and share finding aids and context and authority records is one of the enablers of

new opportunities for collaboration. Encoded Archival Description (EAD) and

Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF)

have been applied in different projects with the objective of sharing finding aids

created by different archives. But the possibilities offered by the combination of

these standards with those defined in the context of the Semantic Web require

further exploration. The progressive adoption of EAC-CPF since its publication in

2010 should not be seen just as a technical choice for encoding context and

authority records, but as an opportunity to figure out semantically enriched browsing

interfaces based on the aggregation of descriptive metadata that share Web-

accessible access points.

Up to this moment, most of the data aggregation initiatives developed in archives

either harvest finding aids from archives that publish them through the Web or the

finding aids are provided to an aggregator who will integrate them into a common

catalog. Metadata formats based on eXtensible Markup Language (XML) and

mature harvesting tools based on the OAI-PMH protocol give the opportunity of

building, with an affordable effort, collective databases that group together

descriptive records from different centers. These collective databases act as a single

point of access where end users can run their searches and be redirected to the

finding aids and to the digital representations of the records.

The aggregation process is usually restricted to EAD finding aids. Being the most

relevant tool to access data about archives’ holdings, finding aids provide detailed

descriptions of the records, their provenance, organization, as well as historical and

biographical details about the persons, organizations, or families that created the

fonds. Thurman (2005, p. 185) defined finding aids as ‘‘a single document that

places the materials in context by consolidating information about acquisition and

processing; provenance, including administrative history or biographical note; scope

of the collection, including size, subjects, media; organization and arrangement; and

an inventory of the series and folders.’’ But today’s archivists need to manage

additional resources that require different types of metadata, and the descriptive

tasks are not only guided by ISAD(G) or EAD: other schemas—not initially

designed by archivists and/or for archivists—need to be used. This is the case of

structural metadata like METS or metadata for describing cultural heritage objects

like VRA Core. Following the idea of Yakel and Kim (2005, p. 1427), it seems more

appropriate to use the term ‘‘access tools’’ instead of ‘‘finding aids’’ to refer to any

Arch Sci

123

kind of surrogate that describes and provides access to the items in an archive.

Although the need to combine different metadata schemas was identified by

different authors (Kim 2003; Cornish 2004; Clavaud and Sevigny 2005; Bak and

Pam 2008; Bountouri and Gergatsoulis 2009), the technical solutions available still

have limitations coping with multiple descriptive schemas. It is even possible to find

limited support to EAC-CPF or to the Simple Knowledge Organization System

(SKOS) format designed to deal, respectively, with context and authority records

and with controlled vocabularies.

The role of EAC-CPF in cooperation and aggregation efforts needs to be

evaluated to assess all its possibilities. With EAD, EAC-CPF is expected to be the

most relevant application of the XML language in archives. EAC-CPF establishes a

schema to create context and authority records compatible with the ISAAR(CPF)

ICA standard. It contains the elements and attributes needed to encode information

about the creators of the records and the circumstances that led to their creation.

EAC-CPF also incorporates elements to assign access points to be used for retrieval

purposes (Szary 2006). EAC-CPF records are created as separate files that can be

linked later to the EAD finding aids. Pitti (2004) identified the benefits of a standard

for describing creators, namely economic benefits, as authority records are costly to

create and maintain, and require a complex intellectual research process. Instead of

having different archives creating authority records for the same entities, authority

records could be made once and shared and completed by other centers. In addition,

EAC-CPF records have a high value as independent information resources, with

their value extending beyond archives, so they could be applied in other scenarios

for managing access to cultural heritage objects and bibliographical materials. Szary

(2006) indicated the need for recognizing the additional value of authority records,

as contextual information offers the basis for building a robust architecture for

archival information systems.

A collaboration framework for archives therefore must not only support archival

standards like EAD and EAC-CPF, but also different types of descriptive schemas

and, more importantly, try to leverage the advantages that context and authority

records and controlled vocabularies may offer to support the discovery of

information spread across multiple archives. This paper describes a technical

architecture for a distributed collaboration environment based on shared context-

related information. The proposed architecture leverages current aggregation

scenarios with the use of EAC-CPF and SKOS. Archives providing descriptive

records to the aggregation service may use as access points for their finding aids

values retrieved from Web-enabled repositories of authority records and thesauri.

Once harvested, the aggregation service can infer relationships between the entities

and subjects represented by the authority records from their co-occurrence in the

same finding aids. The inferred relationships are then used to build an access layer

on top of the aggregated set of descriptive records. End users can browse this access

layer to discover records and relationships that were not initially encoded in the

context and authority records, but explicitly stated in the whole set of aggregated,

indexed finding aids. A semantically enriched information access layer is created as

part of the aggregation process.

Arch Sci

123

The term ‘‘context-based aggregation’’ is used to characterize the proposed

collaboration environment. This term is selected for two reasons: (a) the access layer

is built on access points taken from shared authority lists, and (b) the access layer to

explore the aggregated collection infers relationships between access points from the

context provided in the finding aids where the access points co-occur. The descriptive

metadata records and finding aids provide the context where the relationships

between persons, corporate bodies, families, or subjects happen. The aggregation

process not only harvests and collects metadata records in a single information space,

but infers and gives sense to well-defined, semantic relationships and links between

the entities and topics used as access points for the archival materials.

Research hypotheses

The design of the proposed collaboration framework and its related technical

architecture was motivated by the constraints of the current aggregation initiatives.

This paper is the result of research undertaken in the context of archives of historical

engineering records. The research process was executed with the purpose of

validating these hypotheses:

• In the global context of the Web, archivists lack technical solutions to manage

descriptive metadata and authority records based on different schemas: EAD,

MADS, EAC-CPF, MARC, etc.

• The adoption of standards to encode finding aids, authority records, and

controlled vocabularies is a key requirement to establish a global archival

information system like the one proposed by Pitti (2004). Unless the difficulties

of linking finding aids with authority records and controlled vocabularies are

removed with economically feasible technical solutions, the definition of this

kind of global information system will be extremely complex.

• Collaborative environments for archives need to be built on top of the capability

of reusing context and authority records. The availability of tools with this

capability will promote the adoption of the standard. Unless these tools are

widely available, the risk of EAC-CPF being overlooked by the professional

community is high and needs to be taken seriously.

• Global, large-scale archival information systems must provide end users with

search capabilities beyond full-text and field-qualified searches. New approaches

for browsing and discovering records based on cross-linking between finding

aids, authority records, and controlled vocabularies will significantly improve

the user experience.

• Open-source software solutions developed for archives provide partial support to

XML-based standards like EAD, EAC-CPF, RDF, and SKOS. In most cases,

they are just supported as an import and export mechanism. This partial support

limits the opportunities for archivists to take advantage of these standards.

• Archivists are still reluctant to consider the benefits of adopting specifications

like Resource Description Framework (RDF) or XTM (XML topic maps) as part

of their information systems.

Arch Sci

123

The proposed architecture is designed in response to these hypotheses. The resulting

model and its technical implementation offer archivists and researchers a set of

software tools to create finding aids, link them to remote authority records, and to

build a semantic-based access layer on top of aggregated metadata and access points

based on EAC-CPF and SKOS records. This technical solution is the result of a

4-year collaboration project between CEHOPU (Centre for Historical Studies of

Public Works and Town Planning-CEDEX-Ministerio de Fomento, Spain) and the

research team. The initial scope of the project was the creation of separate Web sites

to make the finding aids for the personal fonds of Spanish civil engineers accessible.

Initial work focused on the tools to create and manage finding aids and reuse context

and authority records in an open environment. The analysis of the access methods

applied on these sites led to the design of an access layer where descriptive records

coming from different archives could be easily located. With this aim, a second

component was designed to collect descriptive metadata records and build an access

layer on top of the aggregated metadata; the relationships between the access points

were inferred and made explicit. This second component also generates Web-based

presentations to browse the relationships between access points and reach finding

aids and document representations.

Archival standards and data aggregation: an overview

Most of the approaches to create collective databases of finding aids described in the

professional literature rely on the use of the EAD standard. EAD was designed to

mitigate the restrictions on finding primary information sources due to the

geographical distribution of the collections (Cornish 2004). Sigler (2009) reviewed

how EAD impacted on the working methods of archivists: with EAD-encoded

finding aids, it would be possible to access detailed descriptions of records and

information resources available at archives located worldwide; fonds and collec-

tions related by provenance but geographically or administrative dispersed could be

virtually integrated on the Internet. This capability leads to a growing interest on

tools to publish EAD descriptions. EAD also contributed to remove the barriers

between archivists and other documentation professionals, traditionally justified by

the uniqueness of the archival records.

A review of the academic and professional literature reports different significant

projects that made use of EAD to create collective databases of finding aids.

Emblematic, pioneer projects include OAC (Online Archives of California), A2A,

and the Archives Hub among others.

Online Archives of California (OAC) was a collective database of finding aids and

digitized materials relevant to the history and culture of the United States (Gilliland-

Swetland 1998). As EAD was gaining stability and was known in the United States,

the University of California started the creation of a collective database from a set of

existing finding aids—with an extension of 30,000 pages—and the materials

digitized in the California Heritage Digital Image Access project (28,000 pictures

about the history and culture of California). The project started in 1996 with the name

Arch Sci

123

UC-EAD and was renamed 2 years later to OAC. At the end of 1998, OAC was

officially incorporated into the California Digital Library (CDL) database.

Cornish (2004) described the Northwest Digital Archives (NWDA) project that

had the participation of fifteen US-based institutions and created a centralized

repository of 2,300 finding aids.

Clavaud and Sevigny (2005) discussed four projects developed in France by the

Centre Historique des Archives Nationales (CHAN) to provide archivists with

methods and tools to control the quality of the finding aids and extract Dublin Core

metadata for their integration in other repositories through the OAI-PMH technical

protocol. The tools included PLEADE and Navimages. The first one was the result

of a private initiative of the companies AJLSM and Anaphore and included a feature

for searching the full text of the EAD documents and indexed elements. It was used

by the CHAN, Denver Public Library, Archives Nationales de France, and Centre

des Archives d’Outre-Mer (Aix-en-Provence, France). Navimages—supported by

the Direction des Archives de France—was a toolkit to organize, process, and

publish collection of digitized images. It was used in the Archives Canada–France

Project. Clavaud and Sevigny remarked that the possibility of using controlled

vocabularies and authority lists to fill specific elements and attributes was

considered, although at this time the number of controlled vocabularies and

indexing practices shared at the national level in France was too low to justify the

investment.

Hill et al. (2005) described three collaborative projects completed in the United

Kingdom: Archives Hub, Access to Archives (A2A), and Navigational Aids for the

History of Science, Technology and the Environment (NAHSTE). The different

ways to apply EAD in these three cases demonstrate the flexibility of the standard to

create, store, index, search, and display finding aids. Archives Hub started in 1999

with the aim of creating a collective online inventory of records from UK colleges

and universities; it collected a total of 18,000 finding aids for fonds and collections.

Archivists created EAD documents using Web-based templates, and the finding aids

were sent to the Archives Hub team that verified the documents and uploaded them

into the central repository. Archives could maintain local copies of their finding aids

and publish them in their Web sites. A harvesting process and a meta-index called

Spokes were developed later to support a distributed repository with centralized

indexes on top.

Access to Archives (A2A) was an online database that offered federated access to

the finding aids of around four hundred UK information centers, including museums

and libraries. The objective of A2A was the retrospective conversion of existing

finding aids into EAD to create the basis for a network of national archives. A2A

was interested in capturing detailed, multilevel finding aids, and around one million

pages were collected. Data providers did not need to create the EAD files: finding

aids on paper or electronic format were sent to a coordination group that converted

them into EAD using the RLG Best Practices Guidelines. As part of the conversion

process, EAD headers were mapped to Dublin Core using encoding analogs and the

names of persons, families, corporate bodies, and places in the elements

\origination[ and \controlaccess[ were controlled using the national authority

lists and the UNESCO thesaurus.

Arch Sci

123

The last project describe by Hill and colleagues is NAHSTE, an initiative of the

Edinburgh University Library, Glasgow University Archive Services, and Heriot-

Watt University Archive to catalog records and manuscripts relevant to the history

of science and technology. The project combined EAD with ISAAR(CPF), due to

the lack of maturity of the existing version of EAC at that time. An XML editor was

used to create the contents; the links between EAD files and authority records were

recorded in a separate database and in the @authfilenumber attribute of the

\persname[, \corpname[, \geogname[ and \subject[ elements. LCSH was

used for subjects and places, and headings were built using the UK National Council

on Archives Rules for the Construction of Personal, Place, and Corporate Names

(NCA Rules).

Imhof (2008) described an interesting project from the German Bundesarchiv to

build a portal for the records of the SED (German Socialist Party) and the FDGB

union available in five archives of the former German Democratic Republic. Imhof

remarked that with EAD, it was possible to create more detailed descriptions than

with ISAD(G). He identified the need to improve the methods applied to search and

display results, as the retrieval system not only needs to show the retrieved items,

but also the relationship between the data in the finding aid and its context.

Huvila (2008) used the term ‘‘participatory archive’’ to refer to a model

characterized by decentralized management and a radical orientation toward end

users. The activities developed as part of the Archival Excellence in Information

Seeking Studies Network (AX-SNet) project were based on this philosophy, trying

to involve end users in the descriptions of records using Web 2.0 technologies. This

approach is aligned with the idea of involving users in the activities of the archive.

From a metadata perspective, they opted for the use of Dublin Core and CIDOC-

CRM to describe the relationships between information resources.

In all these cases, it is possible to find two constants: (a) the use of EAD to

encode and share finding aids, and (b) the difficulties of using controlled

vocabularies and authority records, mainly due to the lack of maturity of the

EAC-CPF standard at the time of developing these initiatives.

To conclude this section, it is essential to include a reference to the Social

Networks and Archival Context (SNAC) project.1

It was launched in 2010 with the participation of the Library of Congress, OCLC

Research, Virginia Heritage, Getty Vocabulary Program, CDL, and Northwest

Digital Archive. Its purpose was the automatic extraction of EAC-CPF data from

the EAD files provided by the network of participants. The initiative goes beyond

the extraction of authority information, as extracted data are validated using

authority files like Library of Congress Name Authority File (LCNAF) and Getty

Vocabulary Program Union Lists of Artist’s Names (ULAN). At the time of writing

this paper, the SNAC has processed more than 20,000 finding aids and around

130,000 EAC-CPF files have been derived. A prototype is available online where

the generated authority files can be browsed and different tools and utilities have

been made available to the professional community, including an SPARQL end

point.

1 http://socialarchive.iath.virginia.edu.

Arch Sci

123

Architecture of the collaboration framework

The proposed collaboration framework is the result of the progressive development

of complementary components. The project started with the design of an EAD-

based tool for archives wanting to describe their collections and records and publish

finding aids on the Web. The architecture initially designed evolved to incorporate

aggregation capabilities and to provide end users with cross-browsing capabilities

based on semantically enriched links. As a result of this evolution, the framework is

made up of two independent components: (a) the component designed for the

creation and management of local descriptions and finding aids, or Local Work

Environment (LWE), and (b) the component designed to aggregate metadata

surrogates and create an access layer based on inferred relationships; it receives the

name of Integrated Metadata Register (IMR).

LWE is a tool for archivists. It is used to create XML-based finding aids and

metadata records based on any other XML schema. Its main component is an XML

editor extended with additional features. It provides the capability of searching

remote repositories or context and authority records (EAC-CPF) and SKOS

thesauri. Retrieved EAC-CPF and SKOS records can be incorporated into the

finding aids as access points. The interaction between the XML editor and the

remote EAC-CPF and SKOS repositories is implemented through a profile of the

SearchRetrieve (SRU) technical protocol, evolution of Z39.50. The use of SRU to

search and retrieve authority records is one of the most interesting features of the

proposed architecture, as it gives the choice of interacting—from different XML

editors and content creation tools—with remote repositories of EAC-CPF or SKOS

data.

With LWE, archivists can also generate Web publications in html and PDF

formats for their finding aids, and create indexes to browse the contents of the local

collections. The Web publications created with LWE are limited to the finding aids

held by the local archive: they do not allow browsing the full set of finding aids

created by the other centers involved in the collaboration environment.

The second component, IMR, works as a repository of aggregated metadata. IMR

compiles a subset of the metadata created by archivists working with their LWEs. It

does not store copies of the aggregated finding aids, but a subset of their metadata

and the links to reach the remote finding aids.

IMR also provides an access tool providing semantically enriched browsing

capabilities to end users. IMR is in fact an XML repository based on the XTM

schema (topic maps). It offers more than a traditional index of persons, corporate

bodies or places pointing to related finding aids. The IMR access layer displays

relationships between persons (engineers and architects in the current implemen-

tation), companies, public institutions, places, subjects, and engineering works.

These relationships are inferred from the access points assigned to the aggregated

finding aids at the moment of their creation. This approach can be seen as an

implementation of the idea proposed by Pitti (2004, 2006) about the combined use

of EAD and EAC-CPF to build an integrated system giving access to descriptive

records from different archives.

Arch Sci

123

If the LWE component supports sharing context and authority records and

thesauri between catalogers, IMR shares with end users and researchers the results

of the collective indexing activity completed by the network of archives. Searching

and browsing are not restricted to a single repository, but to the whole aggregated

set of access points assigned to all the finding aids published by all the archives.

When the user identifies a relevant piece of information in the IMR, he is redirected

to the site hosting the whole descriptive record. IMR offers a single point of access

to the finding aids and works as a gateway to the contents managed by the archives.

The aggregation and processing of metadata surrogates not only feeds a common,

centralized database (in this case, in the form of an XML topic map), but also infers

the relationships between the entities and subjects used as access points in the

aggregated metadata records and incorporates them in the topic map.

The following sections describe the specific functions and technical details of the

implementation and the proposed use of semantic Web standards.

Functions supported by LWE-IMR

LWE users are archivists and end users who browse and search the html

publications generated for their local collections. Archivists create finding aids and

descriptions and publish them on their local Web sites.

LWE supports these major functions:

• Creation of XML-based finding aids and descriptions. When editing descrip-

tions, archivists can search remote authority records hosted in remote

repositories and assign them as access points to their descriptions. The chosen

values shall be added to the appropriate elements in the description, as well as

the Uniform Resource Identifier (URI) assigned to the authority record or SKOS

concept in the remote repository. In the case of EAD, the elements whose values

can be linked to EAC-CPF or SKOS records are \origination[ and the child

elements within\controlaccess[. SKOS entries are always linked to\subject[elements, and the URI of the concept in the SKOS vocabulary is stored in its

@source attribute. EAC-CPF records can be linked to the EAD elements

\origination[, \persname[, \corpname[ or \famname[. The URI of the

EAC-CPF record and the identifier of the EAC-CPF repository are added to the

attributes @source and @authfilenumber (in the case of \origination[, the

element \name[ is used to duplicate the access point).

• The interaction between the XML editors and the EAC-CPF and SKOS

databases is implemented by the exchange of SRU messages and lightweight

REST Web services. When searching the EAC-CPF and SKOS repositories,

archivists can request the display of the entire EAC-CPF or SKOS records, to be

certain they are choosing the right access points.

• Creation of local, Web publications. This includes the generation of HTML and

PDF pages for the finding aids and descriptive records, and indexes to browse

the finding aids by person names, corporate bodies, dates, subjects, and

geographical locations.

Arch Sci

123

• Generate RDF data to be harvested by the IMR. The LWE creates separate RDF

documents for each finding aid or description. These RDF documents contain a

subset of the metadata. In the case of EAD, the elements whose values are

exported to RDF are \eadid[, \repository[, \unitid[, \unittitle[, \scope-

content[, \origination[, \unitdate[, and all the elements within \controlac-

cess[. For those elements whose values are taken from an authority list or

controlled vocabulary, the sources and identifiers of the referred authority

records are also stored in the RDF (Fig. 1).

The functions of the IMR component include (a) harvesting of RDF metadata,

(b) integration and consolidation of harvested metadata into the XTM topic map,

and (c) generating the Web publication for end users. The second and third

functions are the most relevant from the point of view of this research.

• Integration and consolidation of metadata. To integrate metadata from the

harvested RDF files, IMR extracts the elements of the RDF files that correspond

to persons, corporate bodies, subjects, locations, etc., their source vocabulary

and URI, and checks whether the topic map already includes topics with that

URI. If this is the case, the data extracted from the RDF file are merged with the

data existing in the topic map. If the entity referenced in the RDF file is not part

of the topic map, a new topic is added. The finding aid is added as an occurrence

to the topic map, identified by its URL. It will be also used as the context in

which the relationship between related entities or subjects has been identified.

For example, if the topic map already contains a topic for the person Joseph

Wagner, and a new RDF file is processed that has the access points Joseph

Fig. 1 Finding aid for the Torroja Miret fonds

Arch Sci

123

Wagner and Franz Egger, one topic for Franz Egger shall be added to the topic

map, one occurrence for the processed finding aid, and one relationship between

the topics for Joseph Wagner and Franz Egger; the context for this relationship

will point to the finding aid where the relationship has been identified.

• IMR not only aggregates descriptions, but at the same time builds a complex

network of inferred relationships between the entities and subjects that can be

later used to find related data and records and to explore the aggregated

collection.

• Generation of the Web publication. The creation of the HTML pages to browse

the topic map is executed by means of XSLT style sheets. Separate HTML pages

are created for each topic in the topic map, showing a brief description of the

entity and links to related topics grouped by type. Relationships stored in the

topic map are of the following types (the entities that are linked are used as

descriptive names for the different types of relationships):

• Archive—Description unit

• Engineering work—Description unit

• Engineering work—Engineer

• Engineering work—Architect

• Description unit—Corporation/Institution

• Description unit—Place/Location

• Description unit—Subject (from an engineering thesauri)

• Engineering work—Subject

• Engineer—Engineer—Engineering work

• Engineer—Corporation/Institution—Engineering work

IMR implements a single data flow for metadata harvesting via HTTP using the

XMLHttpRequest standard library. The interaction between the IMR and the Web

sites holding the whole EAD files is not considered a data flow, as IMR just

redirects the user to the site where the finding aid is available. The software

components that make up the IMR are the RDF harvester, the RDF processor, and

the publishing utility. All these components are developed in Visual Basic and

XSLT.

Semantic data standards to support collaboration: RDF and XTM

A collaboration framework to support the creation and publication of descriptions

must support widely adopted community standards and standards defined for the

target community of users. This means that support for EAD and EAC-CPF is a

basic requirement. But the possibilities that other standards may bring to archivists

and end users must be considered. As a result of this analysis, the proposed

architecture implemented support to complementary metadata schemas for the

description of information resources like Dublin Core, MODS, and VRA Core.

From a technical perspective, the searching of access points encoded as EAC-CPF

and SKOS records in remote repositories has been implemented using the Search/

Retrieve URL (SRU) standard protocol managed by OASIS. This integration

Arch Sci

123

permits the assignment of access points to different types of descriptive XML

metadata records, regardless of the schema on which they are based.

The proposed architecture also makes use of other standards usually overlooked

by the archival community: RDF and XTM. RDF is used for transferring semantic

information during data aggregation. Once the descriptions are completed, they can

be automatically harvested as RDF records containing a subset of the whole

metadata record or finding aid: title, creator, dates, abstract, and keywords

corresponding to subjects, persons, corporate bodies, and geographic locations. As

the values for some of these elements were initially taken from controlled

vocabularies and authority files, the RDF record includes both the human-readable

label and the URI that uniquely identifies concepts and entities. URI are needed to

ensure the integration and merging of the harvested metadata into the existing topic

map: as different archives can provide metadata about the same entity or subject,

URIs are the mechanism used to know whether different metadata refer to the same

entity.

The RDF file contains separate \rdf:Description[ elements for the information

resource itself and for the entities (persons, corporate bodies, geographic locations,

etc.) and subjects assigned as access points. \rdf:Description[ works as an

envelope for the metadata for the referred entity or subject. The\rdf:Description[element that corresponds to the finding aid contains references to the entities and

subjects within the same RDF file to ensure the correct interpretation and

consistency of the metadata. The figure below shows a sample RDF file generated

for a specific finding aid (see Fig. 2).

RDF records created from finding aids can be regularly harvested by the IMR using

HTTP requests and responses. Once harvested, the RDF files are processed to extract

metadata and merge them into the XML topic map, stored according to the XTM

specification. The use of XTM for data serialization has demonstrated its usefulness to

improve access to information. Topic maps mediate between data repositories and end

users, and in the same way, a subject heading list is used in libraries to mediate between

the bibliographic catalog and the library users. Topic maps were created in the 1990’s

to enable the exchange of the indexes added to technical books and user guides. This

idea was later applied to the collections of digital documents. The topic map

specification was published as an international standard ISO 13250 in 1999 within the

Document Description and Processing Languages group and updated in 2003.

Initially, topic maps were based on SGML (Standard Generalized Markup Language),

but these schemas were later migrated to XML with the creation of XTM (XML topic

maps). The international standard ISO 13250 is made up of different parts: part 2

provides the data model, part 3 the XML syntax.

Topic maps use the term subject to refer to anything whatsoever, regardless of

whether it exists or has any other specific characteristics, about which anything

whatsoever may be asserted by any means whatsoever. Subjects are uniquely

identified by subject identifiers and subject locator. Subject identifiers refer to the

subjects of the statements, and subject locators correspond to specific information

resources. The latter will be used when the information is provided for an

information resource and not for the entity represented by it. For example, if we use

the URL http://www.uc3m.es as a subject locator, the information for this topic in

Arch Sci

123

the topic map shall refer to the Web site of the institution; if we use the same URL

as a subject identifier, the information shall refer to the institution represented by

this URL. Topics refer to the representation of the subjects in the topic map: symbols

used within a topic map to represent one, and only one, subject, in order to allow

statements to be made about the subject. The difference between subjects and topics

is similar to the difference between concepts and designations where a concept is a

unit that encompasses the common characteristics of some objects, and the desig-

nation the word or words used to refer to that concept in a specific language.

In practice, topic maps consist of indexes created for a set of resources. The topic

map is made of (a) a set of topics that index the content of the resources; (b) the

relationships between topics, and (c) the information resources indexed by those

topics. The relationship between one resource and one topic means that that

resource provides information about that topic. Advantages of topic maps as

indexing and retrieval tools go beyond those provided by thesauri or subject

headings, as in topic maps it is possible to categorize the relationships between

topics. The category of the relationships is not restricted in advance, as with subject

headings or thesauri, and indexers can add different types of relationships between

topics. Having no constraints on the types of relationships between topics,

information professionals may summarize with a greater precision the knowledge

embedded in the documents that make up the collection. Topic maps also allow the

creation of relationships between topics, even if these relationships are not explicitly

or implicitly defined in the indexed resources.

Fig. 2 Sample RDF record generated for harvesting

Arch Sci

123

Topic map characteristics are also summarized in three concepts: names,

occurrences, and roles. Names refer to topics using a string of characters, and

several names can be assigned to the same topic, solving the issues related to

synonyms, preferred, and non-preferred terms or names in different languages.

Occurrences are the documents or information resources providing information

about specific topics. Roles represent the role of the topics linked by an association.

The associations can be reified to become the topic of subsequent declaration, and

the topic map’s model can be easily complemented with ontologies to identify valid

sets of names, associations, and roles.

Another interesting feature of topic maps is the scope of topics and associations.

The scope represents the context in which the information provided in the topic map

is valid. The implementation of the topic maps described in this paper makes use of

the scope to indicate in which information resources the relationships between

persons, entities, and archives have been documented.

The use of topic maps to facilitate retrieval of different types of data has been

analyzed by different authors (Yi 2008; Tramullas and Garrido 2006), but this

technology has not been widely adopted by archives. Most references describing the

use of topic maps refer to initiatives to give unified access to heterogeneous

collections of materials and metadata (Schweiger et al. 2003; Venkatesh et al. 2007;

Shien-Chiang 2008). As a summary, the advantages of using topic maps as an

indexing tool are similar to those provided by controlled vocabularies like thesauri

and list of subject headings so include the following:

• The use of descriptors to represent the content of the data ıtems helps improve

the relevance of the retrieval process with respect to the results that may be

obtained using free text searching.

• The relationships between topics allow users to explore the conceptual network

built by the data ıtems and locate other topics that may be relevant to their

search and added as new search criteria.

• Information professionals can describe the content of the information resources

with greater precision.

• New topics and relationships can be created, even if those topics and

relationships are not explicitly stated in the information resource being indexed.

In the case of the topic maps, there is no restriction in the type of relationships

that may be created between topics; any relationships can be made explicit, and

they are not limited to the classical relationships used in controlled vocabularies

(equivalent, broader, or narrower terms).

In this project, the topic map is generated from the RDF files. It provides separate

topics for the persons, companies, institutions, places, engineering works, thematic

subjects, archives holding, and description units identified in the RDF files. It also

specifies the relationships that can be inferred between topics. References to the

indexed documents are treated as occurrences that provide information about these

entities, following the topic map philosophy. The topic map highlights the

relationships between the different entities involved in the creation and custody of

the indexed materials. Both the entities and the relationships are categorized as

requested in the topic maps specification.

Arch Sci

123

In addition to the information about entities and occurrences, the topic map also

records the different relationships between entities and the context in which these

relationships are identified. For example, if an engineer has worked for a specific

company, this relationship is included in the topic map by means of a specific

\association[ element. This element is used to record the relationship between

entities, as well as the type of the relationship. The context of this relationship

corresponds to the finding aids where this relationship is documented, and it is

stored in the \scope[ element of the topic map. The \scope[ element is used to

indicate ‘‘in which context’’ a relationship between entities exists. In this project, the

\scope[ element always includes a reference to the finding aid or descriptive

record where the access points co-occur. Where different finding aids provide

evidence of the same relationship between a pair of entities, the\scope[element is

repeated as many times as needed. This use of the\scope[element is aligned with

its initial purpose in the XTM specification (Fig. 3).

The processing of the RDF files to generate and update the contents of the XTM

file is done by means of a Visual Basic program. This software completes different

Fig. 3 Fragment of the XTM file

Arch Sci

123

steps: it checks whether the entity being processed already exists in the XTM file

by means of its URI, it checks whether the relationship with the other topic

already exists, etc. These controls ensure the integrity of the data in the topic map

and avoid the creation of duplicated entries and relationships. The resulting XTM

file is the actual metadata registry, a single, big XML database containing all the

data needed to retrieve and access the full set of distributed finding aids and

record representations. It is stored in a native XML database, Berkeley DB XML,

that provides users with full-text and qualified searching facilities. This is

combined with an exploratory approach based on browsing the relationships

between topics.

The process and tools used to generate the topic map is described in Fig. 4.

Regarding the tools supporting this process, archivists creating descriptions use

an XML editor to create the finding aids and descriptive records and link them to

remote EAC-CPF files. A plug-in developed in Visual Basic is provided to generate

a RDF file for each finding aid by applying an XSLT stylesheet. These RDF files are

Fig. 4 Processing of source fileto generate the XTM file

Arch Sci

123

made available in a specific folder where they can be harvested by the IMR

component. IMR is the server component in charge of harvesting and aggregating

the metadata. IMR includes (a) a harvesting component developed as a Windows

service in Visual Basic, (b) a Visual Basic program that processes the harvested files

and merges their data into the existing topic map, and (c) a third component also

written in Visual Basic that applies two XSLT transformations to generate the

HTML pages that compose the user interface for browsing the aggregated data set.

IMR administrators can execute the publication generation process to create the

navigation layer and generate the Web-based interface. The creation of the HTML

pages requires an intermediate step to generate separate XML files for each topic

and its relationships. The following file is an example of these intermediate XML

files (Fig. 5).

These intermediate files are converted to HTML by means of a second XSLT

stylesheet. The resulting files contain all the hypertext links to enable users navigate

across the topic map. The picture below shows an example of an HTML page

generated for one topic corresponding to an engineer. The menu at the right-hand

Fig. 5 Intermediate XML file generated for XTM publication

Arch Sci

123

side of the page shows the different types of links (relationship types) leading to a

page with the other entities and subjects (topics) related to this topic (Fig. 6).

Benefits of the proposed architecture

The benefits of the proposed architecture include (a) independence of the archives,

(b) scalability and availability, (c) sharing context and authority records and

thesauri, and (d) economy.

Independence of the archives

Some initiatives for setting up a collaboration environment rely on the creation of a

central database where finding aids and descriptive metadata records are stored.

Archivists are offered some kind of Web-based interface to complete the finding aid

using forms, and the data are directly saved into the remote database. In these cases,

archives depend on the availability of the central database and may suffer delays

and problems related to limits in the number of concurrent users. In addition, the

opportunity to publish and share descriptions in other scenarios is restricted, as the

data are owned by the central repository. The LWE-IMR approach gives archivists

the choice of keeping their own repository and Web publication customized

according to their corporate images, fully integrated into their institutional Web

sites. In addition, risks related to the unavailability of the central database are

minimized.

Fig. 6 User interface for browsing aggregated metadata

Arch Sci

123

Scalability and availability

The aggregation of metadata subsets is more reliable than other technical choices

based on the distribution of searches between repositories and the later consoli-

dation of search results into a single list. The latter is the approach followed by

Z39.50 implementations and metasearch engines. In scenarios based on the

distribution of searches, the performance of the whole system may be negatively

impacted by the performance of the worst subsystem. If one of the nodes produces a

delay in providing results to the metasearch engine, the search process as a whole is

delayed. Due to that reason, batch data aggregation approaches may be considered

better than metasearch, especially in the context of data with low volatility like

archival descriptions.

Sharing context and authority records and thesauri

The sharing of EAC-CPF and SKOS records is accomplished at the LWE by means

of the implementation of the SRU profiles. Most of the open-source software

applications available today for managing finding aids give archivists the possibility

of managing local list of descriptors and authority records. This approach is not

aligned with the current trend of open data that promotes making data available on

the Web for the community of users. With LWE, archivists can search remote

repositories of authority records and thesauri when creating the finding aids, and

EAC-CPF and SKOS records can be easily reused. This leverages the investments

and efforts made in the development of authority records and thesauri, as more and

more centers and projects can access and reuse them.

Finally, and this is the great benefit of the IMR, as the metadata in the finding

aids are linked to authority records and thesauri identified by global URIs, it is

possible to set up a global information discovery system based on the inferred

relationships between entities and concepts.

Economy

Archives wanting to participate in the proposed infrastructure just need to have

Internet access, an XML editor and some utilities, and plug-ins to support the

generation of RDF data and to search remote EAC-CPF and SKOS repositories with

SRU. Exorbitant costs for software or hardware are not required. The technical

infrastructure is easy to maintain and does not require significant investment.

Conclusions

The proposed architecture demonstrates the feasibility of a distributed collaboration

environment based on existing descriptive and semantic metadata standards and

technical protocols. The architecture includes a LWE where archivists can create

finding aids or other descriptive metadata records and assign access points taken

from remote EAC-CPF and SKOS repositories. The SRU protocol is proposed to

Arch Sci

123

interact with these remote repositories. The LWE must be deployed at each archive.

That is to say, there will be as many instances of the LWE as archives participate in

the initiative. On the other hand, the IMR component harvests and aggregates a

subset of metadata and processes them to infer relationships between access points.

The access layer exploits these semantically enriched relationships to improve the

end-user experience when browsing the aggregated metadata.

The implementation of the proposed architecture demonstrates its benefits for

archivists and end users. It makes collaboration between archives easier and

improves traditional methods used to search and access big collections of finding

aids. Users are given additional paths to explore the published items.

The architecture also demonstrates that EAD is a living standard offering great

opportunities to archivists and information professionals. EAD can be used not only

to encode archival finding aids, but descriptions for other types of materials

(photographs, art works, etc.), and can be combined with other metadata. Relational

technologies do not seem to be capable of dealing with the complexity and nesting

levels of multilevel descriptions, and these constraints become even more evident

when it is necessary to manage—in a single repository—descriptive metadata

records based on different schemas: TEI, Dublin Core, MARCXML, METS, etc. In

these cases, the use of native XML is a solution that is technically feasible. The

publishing features incorporated into the architecture are also one relevant factor,

especially considering the conclusions of Yaco (2008), who identified a lack of

experience and knowledge about server technologies (necessary in order to publish

EAD finding aids on the Web) as one of the main obstacles for the adoption of this

standard.

LWE-IMR also proposes a solution to the integration of EAD records with

authority records. This has been a recurrent problem in most of the EAD

implementations identified in the literature. The links between descriptive metadata

and authority records have usually been implemented by means of proprietary

solutions that cannot really be reused between projects. The same situation arises

when considering the links between finding aids and terms from controlled

vocabularies. The proposed implementation of remote access through SRU solves

this issue.

Another hypothesis that is validated with the proposed architecture is the

possibility of applying the standards developed by the Semantic Web community

(RDF, SKOS) in archival practice. Up until now, these standards have had a minor

impact on the business practices of archivists, who consider the Semantic Web

standards as something out of the scope of their work. The literature includes a few

references to the use of Semantic Web standards in archives (Palacios Escalona

2006; Sanchez-Alonso et al. 2008) and popular open-source applications do not

support them (with the exception of ICA-AtoM, that can import and export SKOS

encoded data). The implementation of the IMR demonstrates to what extent these

standards can be useful to design global archival information systems to discover

independently managed data, and to identify the relationships that exist between

scattered fonds and collections that may be related by provenance, subject or by any

other aspect. The IMR helps discover relationships that—in any other way—would

remain hidden to end users and researchers. The combined use of topic maps, XTM,

Arch Sci

123

RDF, SKOS, and EAC-CPF sets an effective scenario to build a semantic rich

access layer for distributed records. The proposed usage of XTM demonstrates the

potential of this specification for metadata discovery. XTM goes beyond traditional

indexes, and the combination of full-text indexing with semantic-based browsing

capabilities offers an appropriate solution for navigating large information spaces.

EAC-CPF and SKOS solve in turn two of the main problems related to the use of

XTM: the need to manage the identity of the topics and the control of the

vocabulary. Topic maps need to standardize the names assigned to topics and

relationships. In this approach, identifiers for topics are taken from the URIs of

EAC-CPF and SKOS records. The possibilities that SKOS offers to establish

equivalency between concepts in different thesauri could also be used to enrich the

semantic access layer. As a conclusion, the collaboration framework demonstrates

that support to these standards should not be viewed as additional features to current

information systems, but as the core features on which future archival information

systems should be based.

The LWE-IMR has proven to be a valid approach to ensure context-based

metadata aggregation and discovery in a network of distributed information centers.

Although the activity was initially planned for archives managing metadata encoded

in EAD and EAC-CPF, the technical infrastructure is fully compatible with other

metadata schemas and descriptive standards.

References

Bak G, Pam A (2008) Points of convergence: seamless long-term access to digital publications and

archival records at library and archives Canada. Arch Sci 8:279–293

Bountouri L, Gergatsoulis M (2009) Interoperability between archival and bibliographic metadata: an

EAD to MODS crosswalk. J Libr Metadata 9(1):98–133

Clavaud F, Sevigny M (2005) Controlling the production of EAD encoded documents, extracting

metadata and publishing them on the web: methods and tools France. J Arch Org 3(2/3):147–169

Cornish A (2004) Using a native XML database for encoded archival description search and retrieval. Inf

Technol Libr 23(4):181

Gilliland-Swetland AJ (1998) Evaluation design for large-scale, collaborative online archives: interim

report of the online archive of California evaluation project. Arch Mus Informatics 12(3/4):177–203

Hill A, Stockting B, Higgins S (2005) Different strokes for different folks: presenting EAD in three UK

online catalogues. J Arch Org 3(2/3):183–206

Huvila I (2008) Participatory archive: towards decentralised curation, radical user orientation, and

broader contextualisation of records management. Arch Sci 8:15–36

Imhof A (2008) Using International Standards to develop a union catalogue for archives in Germany:

Aspects to consider regarding interoperability between libraries and archives. D-Lib Mag 14 (9/10).

http://www.dlib.org/dlib/september08/imhof/09imhof.html. Accessed 7 Jan 2014

Kim H (2003) Myongji University digital library project: implementing a KORMARC/EAD integrated

system. TEL 21(4):367–374

Palacios Escalona JP (2006). Modelo de unificacion semantica de ontologıas, aplicado al dominio de los

archivos digitales. (Doctoral dissertation, Universidad Politecnica de Madrid). http://dialnet.unirioja.

es/servlet/tesis?codigo=2674. Accessed 7 Jan 2014

Pitti DV (2004) Creator description: encoded archival context. Cataloging Classif Q 38(3):201–226

Pitti DV (2006) Technology and the transformation of archival description. J Arch Org 3(2):9–22

Sanchez-Alonso S, Sicilia MA, Rato G (2008) Sobre la interoperabilidad semantica en las descripciones

archivısticas digitales. Revista Esp Doc Cient 31(1):11–38

Schweiger R et al (2003) Linking clinical data using XML topic maps. Artif Intel Med 28:105

Arch Sci

123

Shien-Chiang Y (2008) Discussion on web archives using topic maps. JoEMLS 46(1):55–80

Sigler L (2009) The changing world of archives. PNLA Q 73(4):36–44

Szary RV (2006) Encoded archival context (EAC) and archival description: rationale and background.

J Arch Org 3(2):217–227

Thurman A (2005) Metadata standards for archival control: an introduction to EAC and EAC. Cat Classif

Q 40(3):183–212

Tramullas J, Garrido P (2006) Constructing Web subject gateways using Dublin Core, the Resource

Description Framework and Topic Maps. Inf Res 11(2). http://InformationR.net/ir/11-2/paper248.

html. Accessed 7 Jan 2014

Venkatesh V et al (2007) Topic maps: adopting user-centred indexing technologies in course management

systems. J Interact Learning Res 18:429–450

Yaco S (2008) It’s complicated: barriers to EAD implementation. Am Arch 71(Fall/Winter 2008):

456–475

Yakel E, Kim J (2005) Adoption and diffusion of encoded archival description. J Am Soc Inf Sci Technol

56(13):1427–1437

Yi M (2008) Information organization and retrieval using a topic maps-based ontology: results of a task-

based evaluation. J Am Soc Inf Sci Technol 59(12):1898–1911

Dr. Ricardo Eito-Brun is an Associate Professor at Universidad Carlos III de Madrid, Spain, where he

teaches different subjects related to digital publishing, knowledge organization and representation,

classification, and information management. Ricardo holds a master degree in Informatics from

Universidad Carlos III de Madrid and in Documentation and Information Science from the University of

Granada (Spain) and a doctoral degree from the University of Zaragoza (Spain) on the application of

distributed collaboration environments and Semantic Web techniques for the description and classifi-

cation of archival materials. His research interest is in information management practices, and he has been

responsible for several large-scale content management and Web-based publishing projects for companies

and public institutions in European countries. He is the author of four books on markup languages and

XML and numerous articles and conference papers in the field of information management.

Arch Sci

123