context-based aggregation of archival data: the role of authority records in the semantic landscape
TRANSCRIPT
ORI GIN AL PA PER
Context-based aggregation of archival data: the roleof authority records in the semantic landscape
Ricardo Eito-Brun
� Springer Science+Business Media Dordrecht 2014
Abstract The formal release of Encoded Archival Context for Corporate Bodies,
Persons, and Families (EAC-CPF) in 2010 added the need to deal with an additional
standard to encode context and authority records. But the possibilities of EAC-CPF
go beyond the control of authority records and access points, and this standard
constitutes a relevant milestone in the definition of a complex archival information
system made up of interconnected, cross-linked records. Based on eXtensible
Markup Language, EAC-CPF makes possible the design of semantically rich
browsing experiences that give access to distributed description of records and to
the detailed data of the persons, corporate bodies, or families that created them. This
paper presents a collaboration framework for archival information systems that
exploits the relationships built between finding aids and shared context and
authority records encoded in EAC-CPF. The proposed architecture is built on top of
a set of software components that interact using open information retrieval, content
aggregation, and semantic data standards. On top of this architecture, different user-
oriented solutions can be built to browse and explore the contents of the aggregated
collections. One of these applications is a navigational aid or topic map that serves
as a semantically enriched access layer and ensures the location of the records held
by different archives. The proposed architecture can be applied to solve different
information access challenges that require a single point of access to distributed
data. It can be deployed or mapped to existing technical architectures to improve the
interaction of users with a set of networked repositories.
Keywords XTM � RDF � Metadata aggregation � Engineering archives �Descriptive metadata standards � EAC-CPF
R. Eito-Brun (&)
Universidad Carlos III de Madrid, c/Madrid 124, 28030 Getafe, Madrid, Spain
e-mail: [email protected]
123
Arch Sci
DOI 10.1007/s10502-014-9215-3
Introduction
Archives have made progress in the deployment of Web-based services that provide
access to finding aids and digital representations of the records. The adoption of
tools, software, metadata schemas, and related working procedures has established
the basis for scenarios where archives can leverage cooperative efforts, build
collaborative networks, and figure out innovative information services for the end
users. The work of archivists cannot be understood in isolation, and the possibilities
of reaching descriptions and representations of records held by different archives
through the Web open a new space for researchers, no longer constrained by
geographical location of the records. The adoption of standards to encode, publish,
and share finding aids and context and authority records is one of the enablers of
new opportunities for collaboration. Encoded Archival Description (EAD) and
Encoded Archival Context for Corporate Bodies, Persons, and Families (EAC-CPF)
have been applied in different projects with the objective of sharing finding aids
created by different archives. But the possibilities offered by the combination of
these standards with those defined in the context of the Semantic Web require
further exploration. The progressive adoption of EAC-CPF since its publication in
2010 should not be seen just as a technical choice for encoding context and
authority records, but as an opportunity to figure out semantically enriched browsing
interfaces based on the aggregation of descriptive metadata that share Web-
accessible access points.
Up to this moment, most of the data aggregation initiatives developed in archives
either harvest finding aids from archives that publish them through the Web or the
finding aids are provided to an aggregator who will integrate them into a common
catalog. Metadata formats based on eXtensible Markup Language (XML) and
mature harvesting tools based on the OAI-PMH protocol give the opportunity of
building, with an affordable effort, collective databases that group together
descriptive records from different centers. These collective databases act as a single
point of access where end users can run their searches and be redirected to the
finding aids and to the digital representations of the records.
The aggregation process is usually restricted to EAD finding aids. Being the most
relevant tool to access data about archives’ holdings, finding aids provide detailed
descriptions of the records, their provenance, organization, as well as historical and
biographical details about the persons, organizations, or families that created the
fonds. Thurman (2005, p. 185) defined finding aids as ‘‘a single document that
places the materials in context by consolidating information about acquisition and
processing; provenance, including administrative history or biographical note; scope
of the collection, including size, subjects, media; organization and arrangement; and
an inventory of the series and folders.’’ But today’s archivists need to manage
additional resources that require different types of metadata, and the descriptive
tasks are not only guided by ISAD(G) or EAD: other schemas—not initially
designed by archivists and/or for archivists—need to be used. This is the case of
structural metadata like METS or metadata for describing cultural heritage objects
like VRA Core. Following the idea of Yakel and Kim (2005, p. 1427), it seems more
appropriate to use the term ‘‘access tools’’ instead of ‘‘finding aids’’ to refer to any
Arch Sci
123
kind of surrogate that describes and provides access to the items in an archive.
Although the need to combine different metadata schemas was identified by
different authors (Kim 2003; Cornish 2004; Clavaud and Sevigny 2005; Bak and
Pam 2008; Bountouri and Gergatsoulis 2009), the technical solutions available still
have limitations coping with multiple descriptive schemas. It is even possible to find
limited support to EAC-CPF or to the Simple Knowledge Organization System
(SKOS) format designed to deal, respectively, with context and authority records
and with controlled vocabularies.
The role of EAC-CPF in cooperation and aggregation efforts needs to be
evaluated to assess all its possibilities. With EAD, EAC-CPF is expected to be the
most relevant application of the XML language in archives. EAC-CPF establishes a
schema to create context and authority records compatible with the ISAAR(CPF)
ICA standard. It contains the elements and attributes needed to encode information
about the creators of the records and the circumstances that led to their creation.
EAC-CPF also incorporates elements to assign access points to be used for retrieval
purposes (Szary 2006). EAC-CPF records are created as separate files that can be
linked later to the EAD finding aids. Pitti (2004) identified the benefits of a standard
for describing creators, namely economic benefits, as authority records are costly to
create and maintain, and require a complex intellectual research process. Instead of
having different archives creating authority records for the same entities, authority
records could be made once and shared and completed by other centers. In addition,
EAC-CPF records have a high value as independent information resources, with
their value extending beyond archives, so they could be applied in other scenarios
for managing access to cultural heritage objects and bibliographical materials. Szary
(2006) indicated the need for recognizing the additional value of authority records,
as contextual information offers the basis for building a robust architecture for
archival information systems.
A collaboration framework for archives therefore must not only support archival
standards like EAD and EAC-CPF, but also different types of descriptive schemas
and, more importantly, try to leverage the advantages that context and authority
records and controlled vocabularies may offer to support the discovery of
information spread across multiple archives. This paper describes a technical
architecture for a distributed collaboration environment based on shared context-
related information. The proposed architecture leverages current aggregation
scenarios with the use of EAC-CPF and SKOS. Archives providing descriptive
records to the aggregation service may use as access points for their finding aids
values retrieved from Web-enabled repositories of authority records and thesauri.
Once harvested, the aggregation service can infer relationships between the entities
and subjects represented by the authority records from their co-occurrence in the
same finding aids. The inferred relationships are then used to build an access layer
on top of the aggregated set of descriptive records. End users can browse this access
layer to discover records and relationships that were not initially encoded in the
context and authority records, but explicitly stated in the whole set of aggregated,
indexed finding aids. A semantically enriched information access layer is created as
part of the aggregation process.
Arch Sci
123
The term ‘‘context-based aggregation’’ is used to characterize the proposed
collaboration environment. This term is selected for two reasons: (a) the access layer
is built on access points taken from shared authority lists, and (b) the access layer to
explore the aggregated collection infers relationships between access points from the
context provided in the finding aids where the access points co-occur. The descriptive
metadata records and finding aids provide the context where the relationships
between persons, corporate bodies, families, or subjects happen. The aggregation
process not only harvests and collects metadata records in a single information space,
but infers and gives sense to well-defined, semantic relationships and links between
the entities and topics used as access points for the archival materials.
Research hypotheses
The design of the proposed collaboration framework and its related technical
architecture was motivated by the constraints of the current aggregation initiatives.
This paper is the result of research undertaken in the context of archives of historical
engineering records. The research process was executed with the purpose of
validating these hypotheses:
• In the global context of the Web, archivists lack technical solutions to manage
descriptive metadata and authority records based on different schemas: EAD,
MADS, EAC-CPF, MARC, etc.
• The adoption of standards to encode finding aids, authority records, and
controlled vocabularies is a key requirement to establish a global archival
information system like the one proposed by Pitti (2004). Unless the difficulties
of linking finding aids with authority records and controlled vocabularies are
removed with economically feasible technical solutions, the definition of this
kind of global information system will be extremely complex.
• Collaborative environments for archives need to be built on top of the capability
of reusing context and authority records. The availability of tools with this
capability will promote the adoption of the standard. Unless these tools are
widely available, the risk of EAC-CPF being overlooked by the professional
community is high and needs to be taken seriously.
• Global, large-scale archival information systems must provide end users with
search capabilities beyond full-text and field-qualified searches. New approaches
for browsing and discovering records based on cross-linking between finding
aids, authority records, and controlled vocabularies will significantly improve
the user experience.
• Open-source software solutions developed for archives provide partial support to
XML-based standards like EAD, EAC-CPF, RDF, and SKOS. In most cases,
they are just supported as an import and export mechanism. This partial support
limits the opportunities for archivists to take advantage of these standards.
• Archivists are still reluctant to consider the benefits of adopting specifications
like Resource Description Framework (RDF) or XTM (XML topic maps) as part
of their information systems.
Arch Sci
123
The proposed architecture is designed in response to these hypotheses. The resulting
model and its technical implementation offer archivists and researchers a set of
software tools to create finding aids, link them to remote authority records, and to
build a semantic-based access layer on top of aggregated metadata and access points
based on EAC-CPF and SKOS records. This technical solution is the result of a
4-year collaboration project between CEHOPU (Centre for Historical Studies of
Public Works and Town Planning-CEDEX-Ministerio de Fomento, Spain) and the
research team. The initial scope of the project was the creation of separate Web sites
to make the finding aids for the personal fonds of Spanish civil engineers accessible.
Initial work focused on the tools to create and manage finding aids and reuse context
and authority records in an open environment. The analysis of the access methods
applied on these sites led to the design of an access layer where descriptive records
coming from different archives could be easily located. With this aim, a second
component was designed to collect descriptive metadata records and build an access
layer on top of the aggregated metadata; the relationships between the access points
were inferred and made explicit. This second component also generates Web-based
presentations to browse the relationships between access points and reach finding
aids and document representations.
Archival standards and data aggregation: an overview
Most of the approaches to create collective databases of finding aids described in the
professional literature rely on the use of the EAD standard. EAD was designed to
mitigate the restrictions on finding primary information sources due to the
geographical distribution of the collections (Cornish 2004). Sigler (2009) reviewed
how EAD impacted on the working methods of archivists: with EAD-encoded
finding aids, it would be possible to access detailed descriptions of records and
information resources available at archives located worldwide; fonds and collec-
tions related by provenance but geographically or administrative dispersed could be
virtually integrated on the Internet. This capability leads to a growing interest on
tools to publish EAD descriptions. EAD also contributed to remove the barriers
between archivists and other documentation professionals, traditionally justified by
the uniqueness of the archival records.
A review of the academic and professional literature reports different significant
projects that made use of EAD to create collective databases of finding aids.
Emblematic, pioneer projects include OAC (Online Archives of California), A2A,
and the Archives Hub among others.
Online Archives of California (OAC) was a collective database of finding aids and
digitized materials relevant to the history and culture of the United States (Gilliland-
Swetland 1998). As EAD was gaining stability and was known in the United States,
the University of California started the creation of a collective database from a set of
existing finding aids—with an extension of 30,000 pages—and the materials
digitized in the California Heritage Digital Image Access project (28,000 pictures
about the history and culture of California). The project started in 1996 with the name
Arch Sci
123
UC-EAD and was renamed 2 years later to OAC. At the end of 1998, OAC was
officially incorporated into the California Digital Library (CDL) database.
Cornish (2004) described the Northwest Digital Archives (NWDA) project that
had the participation of fifteen US-based institutions and created a centralized
repository of 2,300 finding aids.
Clavaud and Sevigny (2005) discussed four projects developed in France by the
Centre Historique des Archives Nationales (CHAN) to provide archivists with
methods and tools to control the quality of the finding aids and extract Dublin Core
metadata for their integration in other repositories through the OAI-PMH technical
protocol. The tools included PLEADE and Navimages. The first one was the result
of a private initiative of the companies AJLSM and Anaphore and included a feature
for searching the full text of the EAD documents and indexed elements. It was used
by the CHAN, Denver Public Library, Archives Nationales de France, and Centre
des Archives d’Outre-Mer (Aix-en-Provence, France). Navimages—supported by
the Direction des Archives de France—was a toolkit to organize, process, and
publish collection of digitized images. It was used in the Archives Canada–France
Project. Clavaud and Sevigny remarked that the possibility of using controlled
vocabularies and authority lists to fill specific elements and attributes was
considered, although at this time the number of controlled vocabularies and
indexing practices shared at the national level in France was too low to justify the
investment.
Hill et al. (2005) described three collaborative projects completed in the United
Kingdom: Archives Hub, Access to Archives (A2A), and Navigational Aids for the
History of Science, Technology and the Environment (NAHSTE). The different
ways to apply EAD in these three cases demonstrate the flexibility of the standard to
create, store, index, search, and display finding aids. Archives Hub started in 1999
with the aim of creating a collective online inventory of records from UK colleges
and universities; it collected a total of 18,000 finding aids for fonds and collections.
Archivists created EAD documents using Web-based templates, and the finding aids
were sent to the Archives Hub team that verified the documents and uploaded them
into the central repository. Archives could maintain local copies of their finding aids
and publish them in their Web sites. A harvesting process and a meta-index called
Spokes were developed later to support a distributed repository with centralized
indexes on top.
Access to Archives (A2A) was an online database that offered federated access to
the finding aids of around four hundred UK information centers, including museums
and libraries. The objective of A2A was the retrospective conversion of existing
finding aids into EAD to create the basis for a network of national archives. A2A
was interested in capturing detailed, multilevel finding aids, and around one million
pages were collected. Data providers did not need to create the EAD files: finding
aids on paper or electronic format were sent to a coordination group that converted
them into EAD using the RLG Best Practices Guidelines. As part of the conversion
process, EAD headers were mapped to Dublin Core using encoding analogs and the
names of persons, families, corporate bodies, and places in the elements
\origination[ and \controlaccess[ were controlled using the national authority
lists and the UNESCO thesaurus.
Arch Sci
123
The last project describe by Hill and colleagues is NAHSTE, an initiative of the
Edinburgh University Library, Glasgow University Archive Services, and Heriot-
Watt University Archive to catalog records and manuscripts relevant to the history
of science and technology. The project combined EAD with ISAAR(CPF), due to
the lack of maturity of the existing version of EAC at that time. An XML editor was
used to create the contents; the links between EAD files and authority records were
recorded in a separate database and in the @authfilenumber attribute of the
\persname[, \corpname[, \geogname[ and \subject[ elements. LCSH was
used for subjects and places, and headings were built using the UK National Council
on Archives Rules for the Construction of Personal, Place, and Corporate Names
(NCA Rules).
Imhof (2008) described an interesting project from the German Bundesarchiv to
build a portal for the records of the SED (German Socialist Party) and the FDGB
union available in five archives of the former German Democratic Republic. Imhof
remarked that with EAD, it was possible to create more detailed descriptions than
with ISAD(G). He identified the need to improve the methods applied to search and
display results, as the retrieval system not only needs to show the retrieved items,
but also the relationship between the data in the finding aid and its context.
Huvila (2008) used the term ‘‘participatory archive’’ to refer to a model
characterized by decentralized management and a radical orientation toward end
users. The activities developed as part of the Archival Excellence in Information
Seeking Studies Network (AX-SNet) project were based on this philosophy, trying
to involve end users in the descriptions of records using Web 2.0 technologies. This
approach is aligned with the idea of involving users in the activities of the archive.
From a metadata perspective, they opted for the use of Dublin Core and CIDOC-
CRM to describe the relationships between information resources.
In all these cases, it is possible to find two constants: (a) the use of EAD to
encode and share finding aids, and (b) the difficulties of using controlled
vocabularies and authority records, mainly due to the lack of maturity of the
EAC-CPF standard at the time of developing these initiatives.
To conclude this section, it is essential to include a reference to the Social
Networks and Archival Context (SNAC) project.1
It was launched in 2010 with the participation of the Library of Congress, OCLC
Research, Virginia Heritage, Getty Vocabulary Program, CDL, and Northwest
Digital Archive. Its purpose was the automatic extraction of EAC-CPF data from
the EAD files provided by the network of participants. The initiative goes beyond
the extraction of authority information, as extracted data are validated using
authority files like Library of Congress Name Authority File (LCNAF) and Getty
Vocabulary Program Union Lists of Artist’s Names (ULAN). At the time of writing
this paper, the SNAC has processed more than 20,000 finding aids and around
130,000 EAC-CPF files have been derived. A prototype is available online where
the generated authority files can be browsed and different tools and utilities have
been made available to the professional community, including an SPARQL end
point.
1 http://socialarchive.iath.virginia.edu.
Arch Sci
123
Architecture of the collaboration framework
The proposed collaboration framework is the result of the progressive development
of complementary components. The project started with the design of an EAD-
based tool for archives wanting to describe their collections and records and publish
finding aids on the Web. The architecture initially designed evolved to incorporate
aggregation capabilities and to provide end users with cross-browsing capabilities
based on semantically enriched links. As a result of this evolution, the framework is
made up of two independent components: (a) the component designed for the
creation and management of local descriptions and finding aids, or Local Work
Environment (LWE), and (b) the component designed to aggregate metadata
surrogates and create an access layer based on inferred relationships; it receives the
name of Integrated Metadata Register (IMR).
LWE is a tool for archivists. It is used to create XML-based finding aids and
metadata records based on any other XML schema. Its main component is an XML
editor extended with additional features. It provides the capability of searching
remote repositories or context and authority records (EAC-CPF) and SKOS
thesauri. Retrieved EAC-CPF and SKOS records can be incorporated into the
finding aids as access points. The interaction between the XML editor and the
remote EAC-CPF and SKOS repositories is implemented through a profile of the
SearchRetrieve (SRU) technical protocol, evolution of Z39.50. The use of SRU to
search and retrieve authority records is one of the most interesting features of the
proposed architecture, as it gives the choice of interacting—from different XML
editors and content creation tools—with remote repositories of EAC-CPF or SKOS
data.
With LWE, archivists can also generate Web publications in html and PDF
formats for their finding aids, and create indexes to browse the contents of the local
collections. The Web publications created with LWE are limited to the finding aids
held by the local archive: they do not allow browsing the full set of finding aids
created by the other centers involved in the collaboration environment.
The second component, IMR, works as a repository of aggregated metadata. IMR
compiles a subset of the metadata created by archivists working with their LWEs. It
does not store copies of the aggregated finding aids, but a subset of their metadata
and the links to reach the remote finding aids.
IMR also provides an access tool providing semantically enriched browsing
capabilities to end users. IMR is in fact an XML repository based on the XTM
schema (topic maps). It offers more than a traditional index of persons, corporate
bodies or places pointing to related finding aids. The IMR access layer displays
relationships between persons (engineers and architects in the current implemen-
tation), companies, public institutions, places, subjects, and engineering works.
These relationships are inferred from the access points assigned to the aggregated
finding aids at the moment of their creation. This approach can be seen as an
implementation of the idea proposed by Pitti (2004, 2006) about the combined use
of EAD and EAC-CPF to build an integrated system giving access to descriptive
records from different archives.
Arch Sci
123
If the LWE component supports sharing context and authority records and
thesauri between catalogers, IMR shares with end users and researchers the results
of the collective indexing activity completed by the network of archives. Searching
and browsing are not restricted to a single repository, but to the whole aggregated
set of access points assigned to all the finding aids published by all the archives.
When the user identifies a relevant piece of information in the IMR, he is redirected
to the site hosting the whole descriptive record. IMR offers a single point of access
to the finding aids and works as a gateway to the contents managed by the archives.
The aggregation and processing of metadata surrogates not only feeds a common,
centralized database (in this case, in the form of an XML topic map), but also infers
the relationships between the entities and subjects used as access points in the
aggregated metadata records and incorporates them in the topic map.
The following sections describe the specific functions and technical details of the
implementation and the proposed use of semantic Web standards.
Functions supported by LWE-IMR
LWE users are archivists and end users who browse and search the html
publications generated for their local collections. Archivists create finding aids and
descriptions and publish them on their local Web sites.
LWE supports these major functions:
• Creation of XML-based finding aids and descriptions. When editing descrip-
tions, archivists can search remote authority records hosted in remote
repositories and assign them as access points to their descriptions. The chosen
values shall be added to the appropriate elements in the description, as well as
the Uniform Resource Identifier (URI) assigned to the authority record or SKOS
concept in the remote repository. In the case of EAD, the elements whose values
can be linked to EAC-CPF or SKOS records are \origination[ and the child
elements within\controlaccess[. SKOS entries are always linked to\subject[elements, and the URI of the concept in the SKOS vocabulary is stored in its
@source attribute. EAC-CPF records can be linked to the EAD elements
\origination[, \persname[, \corpname[ or \famname[. The URI of the
EAC-CPF record and the identifier of the EAC-CPF repository are added to the
attributes @source and @authfilenumber (in the case of \origination[, the
element \name[ is used to duplicate the access point).
• The interaction between the XML editors and the EAC-CPF and SKOS
databases is implemented by the exchange of SRU messages and lightweight
REST Web services. When searching the EAC-CPF and SKOS repositories,
archivists can request the display of the entire EAC-CPF or SKOS records, to be
certain they are choosing the right access points.
• Creation of local, Web publications. This includes the generation of HTML and
PDF pages for the finding aids and descriptive records, and indexes to browse
the finding aids by person names, corporate bodies, dates, subjects, and
geographical locations.
Arch Sci
123
• Generate RDF data to be harvested by the IMR. The LWE creates separate RDF
documents for each finding aid or description. These RDF documents contain a
subset of the metadata. In the case of EAD, the elements whose values are
exported to RDF are \eadid[, \repository[, \unitid[, \unittitle[, \scope-
content[, \origination[, \unitdate[, and all the elements within \controlac-
cess[. For those elements whose values are taken from an authority list or
controlled vocabulary, the sources and identifiers of the referred authority
records are also stored in the RDF (Fig. 1).
The functions of the IMR component include (a) harvesting of RDF metadata,
(b) integration and consolidation of harvested metadata into the XTM topic map,
and (c) generating the Web publication for end users. The second and third
functions are the most relevant from the point of view of this research.
• Integration and consolidation of metadata. To integrate metadata from the
harvested RDF files, IMR extracts the elements of the RDF files that correspond
to persons, corporate bodies, subjects, locations, etc., their source vocabulary
and URI, and checks whether the topic map already includes topics with that
URI. If this is the case, the data extracted from the RDF file are merged with the
data existing in the topic map. If the entity referenced in the RDF file is not part
of the topic map, a new topic is added. The finding aid is added as an occurrence
to the topic map, identified by its URL. It will be also used as the context in
which the relationship between related entities or subjects has been identified.
For example, if the topic map already contains a topic for the person Joseph
Wagner, and a new RDF file is processed that has the access points Joseph
Fig. 1 Finding aid for the Torroja Miret fonds
Arch Sci
123
Wagner and Franz Egger, one topic for Franz Egger shall be added to the topic
map, one occurrence for the processed finding aid, and one relationship between
the topics for Joseph Wagner and Franz Egger; the context for this relationship
will point to the finding aid where the relationship has been identified.
• IMR not only aggregates descriptions, but at the same time builds a complex
network of inferred relationships between the entities and subjects that can be
later used to find related data and records and to explore the aggregated
collection.
• Generation of the Web publication. The creation of the HTML pages to browse
the topic map is executed by means of XSLT style sheets. Separate HTML pages
are created for each topic in the topic map, showing a brief description of the
entity and links to related topics grouped by type. Relationships stored in the
topic map are of the following types (the entities that are linked are used as
descriptive names for the different types of relationships):
• Archive—Description unit
• Engineering work—Description unit
• Engineering work—Engineer
• Engineering work—Architect
• Description unit—Corporation/Institution
• Description unit—Place/Location
• Description unit—Subject (from an engineering thesauri)
• Engineering work—Subject
• Engineer—Engineer—Engineering work
• Engineer—Corporation/Institution—Engineering work
IMR implements a single data flow for metadata harvesting via HTTP using the
XMLHttpRequest standard library. The interaction between the IMR and the Web
sites holding the whole EAD files is not considered a data flow, as IMR just
redirects the user to the site where the finding aid is available. The software
components that make up the IMR are the RDF harvester, the RDF processor, and
the publishing utility. All these components are developed in Visual Basic and
XSLT.
Semantic data standards to support collaboration: RDF and XTM
A collaboration framework to support the creation and publication of descriptions
must support widely adopted community standards and standards defined for the
target community of users. This means that support for EAD and EAC-CPF is a
basic requirement. But the possibilities that other standards may bring to archivists
and end users must be considered. As a result of this analysis, the proposed
architecture implemented support to complementary metadata schemas for the
description of information resources like Dublin Core, MODS, and VRA Core.
From a technical perspective, the searching of access points encoded as EAC-CPF
and SKOS records in remote repositories has been implemented using the Search/
Retrieve URL (SRU) standard protocol managed by OASIS. This integration
Arch Sci
123
permits the assignment of access points to different types of descriptive XML
metadata records, regardless of the schema on which they are based.
The proposed architecture also makes use of other standards usually overlooked
by the archival community: RDF and XTM. RDF is used for transferring semantic
information during data aggregation. Once the descriptions are completed, they can
be automatically harvested as RDF records containing a subset of the whole
metadata record or finding aid: title, creator, dates, abstract, and keywords
corresponding to subjects, persons, corporate bodies, and geographic locations. As
the values for some of these elements were initially taken from controlled
vocabularies and authority files, the RDF record includes both the human-readable
label and the URI that uniquely identifies concepts and entities. URI are needed to
ensure the integration and merging of the harvested metadata into the existing topic
map: as different archives can provide metadata about the same entity or subject,
URIs are the mechanism used to know whether different metadata refer to the same
entity.
The RDF file contains separate \rdf:Description[ elements for the information
resource itself and for the entities (persons, corporate bodies, geographic locations,
etc.) and subjects assigned as access points. \rdf:Description[ works as an
envelope for the metadata for the referred entity or subject. The\rdf:Description[element that corresponds to the finding aid contains references to the entities and
subjects within the same RDF file to ensure the correct interpretation and
consistency of the metadata. The figure below shows a sample RDF file generated
for a specific finding aid (see Fig. 2).
RDF records created from finding aids can be regularly harvested by the IMR using
HTTP requests and responses. Once harvested, the RDF files are processed to extract
metadata and merge them into the XML topic map, stored according to the XTM
specification. The use of XTM for data serialization has demonstrated its usefulness to
improve access to information. Topic maps mediate between data repositories and end
users, and in the same way, a subject heading list is used in libraries to mediate between
the bibliographic catalog and the library users. Topic maps were created in the 1990’s
to enable the exchange of the indexes added to technical books and user guides. This
idea was later applied to the collections of digital documents. The topic map
specification was published as an international standard ISO 13250 in 1999 within the
Document Description and Processing Languages group and updated in 2003.
Initially, topic maps were based on SGML (Standard Generalized Markup Language),
but these schemas were later migrated to XML with the creation of XTM (XML topic
maps). The international standard ISO 13250 is made up of different parts: part 2
provides the data model, part 3 the XML syntax.
Topic maps use the term subject to refer to anything whatsoever, regardless of
whether it exists or has any other specific characteristics, about which anything
whatsoever may be asserted by any means whatsoever. Subjects are uniquely
identified by subject identifiers and subject locator. Subject identifiers refer to the
subjects of the statements, and subject locators correspond to specific information
resources. The latter will be used when the information is provided for an
information resource and not for the entity represented by it. For example, if we use
the URL http://www.uc3m.es as a subject locator, the information for this topic in
Arch Sci
123
the topic map shall refer to the Web site of the institution; if we use the same URL
as a subject identifier, the information shall refer to the institution represented by
this URL. Topics refer to the representation of the subjects in the topic map: symbols
used within a topic map to represent one, and only one, subject, in order to allow
statements to be made about the subject. The difference between subjects and topics
is similar to the difference between concepts and designations where a concept is a
unit that encompasses the common characteristics of some objects, and the desig-
nation the word or words used to refer to that concept in a specific language.
In practice, topic maps consist of indexes created for a set of resources. The topic
map is made of (a) a set of topics that index the content of the resources; (b) the
relationships between topics, and (c) the information resources indexed by those
topics. The relationship between one resource and one topic means that that
resource provides information about that topic. Advantages of topic maps as
indexing and retrieval tools go beyond those provided by thesauri or subject
headings, as in topic maps it is possible to categorize the relationships between
topics. The category of the relationships is not restricted in advance, as with subject
headings or thesauri, and indexers can add different types of relationships between
topics. Having no constraints on the types of relationships between topics,
information professionals may summarize with a greater precision the knowledge
embedded in the documents that make up the collection. Topic maps also allow the
creation of relationships between topics, even if these relationships are not explicitly
or implicitly defined in the indexed resources.
Fig. 2 Sample RDF record generated for harvesting
Arch Sci
123
Topic map characteristics are also summarized in three concepts: names,
occurrences, and roles. Names refer to topics using a string of characters, and
several names can be assigned to the same topic, solving the issues related to
synonyms, preferred, and non-preferred terms or names in different languages.
Occurrences are the documents or information resources providing information
about specific topics. Roles represent the role of the topics linked by an association.
The associations can be reified to become the topic of subsequent declaration, and
the topic map’s model can be easily complemented with ontologies to identify valid
sets of names, associations, and roles.
Another interesting feature of topic maps is the scope of topics and associations.
The scope represents the context in which the information provided in the topic map
is valid. The implementation of the topic maps described in this paper makes use of
the scope to indicate in which information resources the relationships between
persons, entities, and archives have been documented.
The use of topic maps to facilitate retrieval of different types of data has been
analyzed by different authors (Yi 2008; Tramullas and Garrido 2006), but this
technology has not been widely adopted by archives. Most references describing the
use of topic maps refer to initiatives to give unified access to heterogeneous
collections of materials and metadata (Schweiger et al. 2003; Venkatesh et al. 2007;
Shien-Chiang 2008). As a summary, the advantages of using topic maps as an
indexing tool are similar to those provided by controlled vocabularies like thesauri
and list of subject headings so include the following:
• The use of descriptors to represent the content of the data ıtems helps improve
the relevance of the retrieval process with respect to the results that may be
obtained using free text searching.
• The relationships between topics allow users to explore the conceptual network
built by the data ıtems and locate other topics that may be relevant to their
search and added as new search criteria.
• Information professionals can describe the content of the information resources
with greater precision.
• New topics and relationships can be created, even if those topics and
relationships are not explicitly stated in the information resource being indexed.
In the case of the topic maps, there is no restriction in the type of relationships
that may be created between topics; any relationships can be made explicit, and
they are not limited to the classical relationships used in controlled vocabularies
(equivalent, broader, or narrower terms).
In this project, the topic map is generated from the RDF files. It provides separate
topics for the persons, companies, institutions, places, engineering works, thematic
subjects, archives holding, and description units identified in the RDF files. It also
specifies the relationships that can be inferred between topics. References to the
indexed documents are treated as occurrences that provide information about these
entities, following the topic map philosophy. The topic map highlights the
relationships between the different entities involved in the creation and custody of
the indexed materials. Both the entities and the relationships are categorized as
requested in the topic maps specification.
Arch Sci
123
In addition to the information about entities and occurrences, the topic map also
records the different relationships between entities and the context in which these
relationships are identified. For example, if an engineer has worked for a specific
company, this relationship is included in the topic map by means of a specific
\association[ element. This element is used to record the relationship between
entities, as well as the type of the relationship. The context of this relationship
corresponds to the finding aids where this relationship is documented, and it is
stored in the \scope[ element of the topic map. The \scope[ element is used to
indicate ‘‘in which context’’ a relationship between entities exists. In this project, the
\scope[ element always includes a reference to the finding aid or descriptive
record where the access points co-occur. Where different finding aids provide
evidence of the same relationship between a pair of entities, the\scope[element is
repeated as many times as needed. This use of the\scope[element is aligned with
its initial purpose in the XTM specification (Fig. 3).
The processing of the RDF files to generate and update the contents of the XTM
file is done by means of a Visual Basic program. This software completes different
Fig. 3 Fragment of the XTM file
Arch Sci
123
steps: it checks whether the entity being processed already exists in the XTM file
by means of its URI, it checks whether the relationship with the other topic
already exists, etc. These controls ensure the integrity of the data in the topic map
and avoid the creation of duplicated entries and relationships. The resulting XTM
file is the actual metadata registry, a single, big XML database containing all the
data needed to retrieve and access the full set of distributed finding aids and
record representations. It is stored in a native XML database, Berkeley DB XML,
that provides users with full-text and qualified searching facilities. This is
combined with an exploratory approach based on browsing the relationships
between topics.
The process and tools used to generate the topic map is described in Fig. 4.
Regarding the tools supporting this process, archivists creating descriptions use
an XML editor to create the finding aids and descriptive records and link them to
remote EAC-CPF files. A plug-in developed in Visual Basic is provided to generate
a RDF file for each finding aid by applying an XSLT stylesheet. These RDF files are
Fig. 4 Processing of source fileto generate the XTM file
Arch Sci
123
made available in a specific folder where they can be harvested by the IMR
component. IMR is the server component in charge of harvesting and aggregating
the metadata. IMR includes (a) a harvesting component developed as a Windows
service in Visual Basic, (b) a Visual Basic program that processes the harvested files
and merges their data into the existing topic map, and (c) a third component also
written in Visual Basic that applies two XSLT transformations to generate the
HTML pages that compose the user interface for browsing the aggregated data set.
IMR administrators can execute the publication generation process to create the
navigation layer and generate the Web-based interface. The creation of the HTML
pages requires an intermediate step to generate separate XML files for each topic
and its relationships. The following file is an example of these intermediate XML
files (Fig. 5).
These intermediate files are converted to HTML by means of a second XSLT
stylesheet. The resulting files contain all the hypertext links to enable users navigate
across the topic map. The picture below shows an example of an HTML page
generated for one topic corresponding to an engineer. The menu at the right-hand
Fig. 5 Intermediate XML file generated for XTM publication
Arch Sci
123
side of the page shows the different types of links (relationship types) leading to a
page with the other entities and subjects (topics) related to this topic (Fig. 6).
Benefits of the proposed architecture
The benefits of the proposed architecture include (a) independence of the archives,
(b) scalability and availability, (c) sharing context and authority records and
thesauri, and (d) economy.
Independence of the archives
Some initiatives for setting up a collaboration environment rely on the creation of a
central database where finding aids and descriptive metadata records are stored.
Archivists are offered some kind of Web-based interface to complete the finding aid
using forms, and the data are directly saved into the remote database. In these cases,
archives depend on the availability of the central database and may suffer delays
and problems related to limits in the number of concurrent users. In addition, the
opportunity to publish and share descriptions in other scenarios is restricted, as the
data are owned by the central repository. The LWE-IMR approach gives archivists
the choice of keeping their own repository and Web publication customized
according to their corporate images, fully integrated into their institutional Web
sites. In addition, risks related to the unavailability of the central database are
minimized.
Fig. 6 User interface for browsing aggregated metadata
Arch Sci
123
Scalability and availability
The aggregation of metadata subsets is more reliable than other technical choices
based on the distribution of searches between repositories and the later consoli-
dation of search results into a single list. The latter is the approach followed by
Z39.50 implementations and metasearch engines. In scenarios based on the
distribution of searches, the performance of the whole system may be negatively
impacted by the performance of the worst subsystem. If one of the nodes produces a
delay in providing results to the metasearch engine, the search process as a whole is
delayed. Due to that reason, batch data aggregation approaches may be considered
better than metasearch, especially in the context of data with low volatility like
archival descriptions.
Sharing context and authority records and thesauri
The sharing of EAC-CPF and SKOS records is accomplished at the LWE by means
of the implementation of the SRU profiles. Most of the open-source software
applications available today for managing finding aids give archivists the possibility
of managing local list of descriptors and authority records. This approach is not
aligned with the current trend of open data that promotes making data available on
the Web for the community of users. With LWE, archivists can search remote
repositories of authority records and thesauri when creating the finding aids, and
EAC-CPF and SKOS records can be easily reused. This leverages the investments
and efforts made in the development of authority records and thesauri, as more and
more centers and projects can access and reuse them.
Finally, and this is the great benefit of the IMR, as the metadata in the finding
aids are linked to authority records and thesauri identified by global URIs, it is
possible to set up a global information discovery system based on the inferred
relationships between entities and concepts.
Economy
Archives wanting to participate in the proposed infrastructure just need to have
Internet access, an XML editor and some utilities, and plug-ins to support the
generation of RDF data and to search remote EAC-CPF and SKOS repositories with
SRU. Exorbitant costs for software or hardware are not required. The technical
infrastructure is easy to maintain and does not require significant investment.
Conclusions
The proposed architecture demonstrates the feasibility of a distributed collaboration
environment based on existing descriptive and semantic metadata standards and
technical protocols. The architecture includes a LWE where archivists can create
finding aids or other descriptive metadata records and assign access points taken
from remote EAC-CPF and SKOS repositories. The SRU protocol is proposed to
Arch Sci
123
interact with these remote repositories. The LWE must be deployed at each archive.
That is to say, there will be as many instances of the LWE as archives participate in
the initiative. On the other hand, the IMR component harvests and aggregates a
subset of metadata and processes them to infer relationships between access points.
The access layer exploits these semantically enriched relationships to improve the
end-user experience when browsing the aggregated metadata.
The implementation of the proposed architecture demonstrates its benefits for
archivists and end users. It makes collaboration between archives easier and
improves traditional methods used to search and access big collections of finding
aids. Users are given additional paths to explore the published items.
The architecture also demonstrates that EAD is a living standard offering great
opportunities to archivists and information professionals. EAD can be used not only
to encode archival finding aids, but descriptions for other types of materials
(photographs, art works, etc.), and can be combined with other metadata. Relational
technologies do not seem to be capable of dealing with the complexity and nesting
levels of multilevel descriptions, and these constraints become even more evident
when it is necessary to manage—in a single repository—descriptive metadata
records based on different schemas: TEI, Dublin Core, MARCXML, METS, etc. In
these cases, the use of native XML is a solution that is technically feasible. The
publishing features incorporated into the architecture are also one relevant factor,
especially considering the conclusions of Yaco (2008), who identified a lack of
experience and knowledge about server technologies (necessary in order to publish
EAD finding aids on the Web) as one of the main obstacles for the adoption of this
standard.
LWE-IMR also proposes a solution to the integration of EAD records with
authority records. This has been a recurrent problem in most of the EAD
implementations identified in the literature. The links between descriptive metadata
and authority records have usually been implemented by means of proprietary
solutions that cannot really be reused between projects. The same situation arises
when considering the links between finding aids and terms from controlled
vocabularies. The proposed implementation of remote access through SRU solves
this issue.
Another hypothesis that is validated with the proposed architecture is the
possibility of applying the standards developed by the Semantic Web community
(RDF, SKOS) in archival practice. Up until now, these standards have had a minor
impact on the business practices of archivists, who consider the Semantic Web
standards as something out of the scope of their work. The literature includes a few
references to the use of Semantic Web standards in archives (Palacios Escalona
2006; Sanchez-Alonso et al. 2008) and popular open-source applications do not
support them (with the exception of ICA-AtoM, that can import and export SKOS
encoded data). The implementation of the IMR demonstrates to what extent these
standards can be useful to design global archival information systems to discover
independently managed data, and to identify the relationships that exist between
scattered fonds and collections that may be related by provenance, subject or by any
other aspect. The IMR helps discover relationships that—in any other way—would
remain hidden to end users and researchers. The combined use of topic maps, XTM,
Arch Sci
123
RDF, SKOS, and EAC-CPF sets an effective scenario to build a semantic rich
access layer for distributed records. The proposed usage of XTM demonstrates the
potential of this specification for metadata discovery. XTM goes beyond traditional
indexes, and the combination of full-text indexing with semantic-based browsing
capabilities offers an appropriate solution for navigating large information spaces.
EAC-CPF and SKOS solve in turn two of the main problems related to the use of
XTM: the need to manage the identity of the topics and the control of the
vocabulary. Topic maps need to standardize the names assigned to topics and
relationships. In this approach, identifiers for topics are taken from the URIs of
EAC-CPF and SKOS records. The possibilities that SKOS offers to establish
equivalency between concepts in different thesauri could also be used to enrich the
semantic access layer. As a conclusion, the collaboration framework demonstrates
that support to these standards should not be viewed as additional features to current
information systems, but as the core features on which future archival information
systems should be based.
The LWE-IMR has proven to be a valid approach to ensure context-based
metadata aggregation and discovery in a network of distributed information centers.
Although the activity was initially planned for archives managing metadata encoded
in EAD and EAC-CPF, the technical infrastructure is fully compatible with other
metadata schemas and descriptive standards.
References
Bak G, Pam A (2008) Points of convergence: seamless long-term access to digital publications and
archival records at library and archives Canada. Arch Sci 8:279–293
Bountouri L, Gergatsoulis M (2009) Interoperability between archival and bibliographic metadata: an
EAD to MODS crosswalk. J Libr Metadata 9(1):98–133
Clavaud F, Sevigny M (2005) Controlling the production of EAD encoded documents, extracting
metadata and publishing them on the web: methods and tools France. J Arch Org 3(2/3):147–169
Cornish A (2004) Using a native XML database for encoded archival description search and retrieval. Inf
Technol Libr 23(4):181
Gilliland-Swetland AJ (1998) Evaluation design for large-scale, collaborative online archives: interim
report of the online archive of California evaluation project. Arch Mus Informatics 12(3/4):177–203
Hill A, Stockting B, Higgins S (2005) Different strokes for different folks: presenting EAD in three UK
online catalogues. J Arch Org 3(2/3):183–206
Huvila I (2008) Participatory archive: towards decentralised curation, radical user orientation, and
broader contextualisation of records management. Arch Sci 8:15–36
Imhof A (2008) Using International Standards to develop a union catalogue for archives in Germany:
Aspects to consider regarding interoperability between libraries and archives. D-Lib Mag 14 (9/10).
http://www.dlib.org/dlib/september08/imhof/09imhof.html. Accessed 7 Jan 2014
Kim H (2003) Myongji University digital library project: implementing a KORMARC/EAD integrated
system. TEL 21(4):367–374
Palacios Escalona JP (2006). Modelo de unificacion semantica de ontologıas, aplicado al dominio de los
archivos digitales. (Doctoral dissertation, Universidad Politecnica de Madrid). http://dialnet.unirioja.
es/servlet/tesis?codigo=2674. Accessed 7 Jan 2014
Pitti DV (2004) Creator description: encoded archival context. Cataloging Classif Q 38(3):201–226
Pitti DV (2006) Technology and the transformation of archival description. J Arch Org 3(2):9–22
Sanchez-Alonso S, Sicilia MA, Rato G (2008) Sobre la interoperabilidad semantica en las descripciones
archivısticas digitales. Revista Esp Doc Cient 31(1):11–38
Schweiger R et al (2003) Linking clinical data using XML topic maps. Artif Intel Med 28:105
Arch Sci
123
Shien-Chiang Y (2008) Discussion on web archives using topic maps. JoEMLS 46(1):55–80
Sigler L (2009) The changing world of archives. PNLA Q 73(4):36–44
Szary RV (2006) Encoded archival context (EAC) and archival description: rationale and background.
J Arch Org 3(2):217–227
Thurman A (2005) Metadata standards for archival control: an introduction to EAC and EAC. Cat Classif
Q 40(3):183–212
Tramullas J, Garrido P (2006) Constructing Web subject gateways using Dublin Core, the Resource
Description Framework and Topic Maps. Inf Res 11(2). http://InformationR.net/ir/11-2/paper248.
html. Accessed 7 Jan 2014
Venkatesh V et al (2007) Topic maps: adopting user-centred indexing technologies in course management
systems. J Interact Learning Res 18:429–450
Yaco S (2008) It’s complicated: barriers to EAD implementation. Am Arch 71(Fall/Winter 2008):
456–475
Yakel E, Kim J (2005) Adoption and diffusion of encoded archival description. J Am Soc Inf Sci Technol
56(13):1427–1437
Yi M (2008) Information organization and retrieval using a topic maps-based ontology: results of a task-
based evaluation. J Am Soc Inf Sci Technol 59(12):1898–1911
Dr. Ricardo Eito-Brun is an Associate Professor at Universidad Carlos III de Madrid, Spain, where he
teaches different subjects related to digital publishing, knowledge organization and representation,
classification, and information management. Ricardo holds a master degree in Informatics from
Universidad Carlos III de Madrid and in Documentation and Information Science from the University of
Granada (Spain) and a doctoral degree from the University of Zaragoza (Spain) on the application of
distributed collaboration environments and Semantic Web techniques for the description and classifi-
cation of archival materials. His research interest is in information management practices, and he has been
responsible for several large-scale content management and Web-based publishing projects for companies
and public institutions in European countries. He is the author of four books on markup languages and
XML and numerous articles and conference papers in the field of information management.
Arch Sci
123