the vra core application profile for searching and ...guus/papers/assem10a.pdf · in this paper we...

19
Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010 The VRA Core Application Profile for Searching and Presenting Cultural Heritage: the MultimediaN Case Mark van Assem VU University, Amsterdam [email protected] Jacco van Ossenbruggen VU University / CWI Amsterdam [email protected] Guus Schreiber VU University, Amsterdam [email protected] Abstract In this paper we describe the usage of the VRA metadata schema in converting, searching through, presenting and navigating cultural heritage collections in RDF. We use the MultimediaN demonstrator as illustration. Our research question is how to represent heterogeneous metadata collections in RDF for usage in generic semantic search, navigation and presentation features, while preserving the original data as closely as possible. We investigate possible approaches to conversion of the data that will fulfill the application’s needs, and found that a representation of three schema levels is most useful. We call these the collection-specific, domain-specific and generic levels. The domain-specific level abstracts from individual collections, but is more specific than domain-independent, generic schemas such as Dublin Core. An existing specification called VRA Core 3.0 suited our need for a domain-specific schema, so we developed an Application Profile for it, focusing mainly on the Description Set Profile. In the process we developed an RDF schema for VRA Core and used OWL Restrictions to indicate expected value ranges. Keywords: metadata conversion; metadata schema; semantic search; cultural heritage; visual art 1. Introduction The MultimediaN E-Culture project started in 2005 with the goal to create a semantic search engine that encompasses metadata of several large (mainly Dutch) cultural heritage institutions, such as Rijksmuseum Amsterdam and the Royal Tropical Institute. The data was converted to RDF, and a demonstrator was built that included faceted browsing, keyword disambiguation, autocompletion, time-based search, relation discovery and automatic clustering of search results. The demonstrator won the award for best Semantic Web application at the International Semantic Web Conference (Schreiber et al., 2006); it is accessible at http://e-culture.multimedian.nl/ . We discuss some options for conversion of the raw data into one or more metadata schemas. The approach we chose results in three levels of metadata: collection-specific, domain-specific and generic. We show how these levels all have their function in the search and presentation facilities of the demonstrator. The infrastructure required for this project turns out to be generic for semantic search. The demonstrator’s backend is called ClioPatria, and is based on SWI-Prolog’s Semantic Web and Server libraries. It is now in use by several other projects that require semantic search and presentation facilities, including the Poseidon project (piracy attacks), the DBtune project (music- related mashups) and the CHIP project (art recommender system). ClioPatria is able to deal with several levels of metadata by providing a scalable indexing of rdfs:subPropertyOf relationships. ClioPatria also forms the basis for the Europeana project’s Thought Lab. The system received an honorable mention in the In-Use track of the 2008 International Semantic Web Conference (Wielemaker et al., 2008).

Upload: hadan

Post on 30-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

The VRA Core Application Profile for Searching and Presenting Cultural Heritage: the MultimediaN Case

Mark van Assem

VU University, Amsterdam [email protected]

Jacco van Ossenbruggen VU University / CWI

Amsterdam [email protected]

Guus Schreiber VU University, Amsterdam

[email protected]

Abstract In this paper we describe the usage of the VRA metadata schema in converting, searching through, presenting and navigating cultural heritage collections in RDF. We use the MultimediaN demonstrator as illustration. Our research question is how to represent heterogeneous metadata collections in RDF for usage in generic semantic search, navigation and presentation features, while preserving the original data as closely as possible. We investigate possible approaches to conversion of the data that will fulfill the application’s needs, and found that a representation of three schema levels is most useful. We call these the collection-specific, domain-specific and generic levels. The domain-specific level abstracts from individual collections, but is more specific than domain-independent, generic schemas such as Dublin Core. An existing specification called VRA Core 3.0 suited our need for a domain-specific schema, so we developed an Application Profile for it, focusing mainly on the Description Set Profile. In the process we developed an RDF schema for VRA Core and used OWL Restrictions to indicate expected value ranges. Keywords: metadata conversion; metadata schema; semantic search; cultural heritage; visual art

1. Introduction The MultimediaN E-Culture project started in 2005 with the goal to create a semantic search

engine that encompasses metadata of several large (mainly Dutch) cultural heritage institutions, such as Rijksmuseum Amsterdam and the Royal Tropical Institute. The data was converted to RDF, and a demonstrator was built that included faceted browsing, keyword disambiguation, autocompletion, time-based search, relation discovery and automatic clustering of search results. The demonstrator won the award for best Semantic Web application at the International Semantic Web Conference (Schreiber et al., 2006); it is accessible at http://e-culture.multimedian.nl/.

We discuss some options for conversion of the raw data into one or more metadata schemas. The approach we chose results in three levels of metadata: collection-specific, domain-specific and generic. We show how these levels all have their function in the search and presentation facilities of the demonstrator.

The infrastructure required for this project turns out to be generic for semantic search. The demonstrator’s backend is called ClioPatria, and is based on SWI-Prolog’s Semantic Web and Server libraries. It is now in use by several other projects that require semantic search and presentation facilities, including the Poseidon project (piracy attacks), the DBtune project (music-related mashups) and the CHIP project (art recommender system). ClioPatria is able to deal with several levels of metadata by providing a scalable indexing of rdfs:subPropertyOf relationships. ClioPatria also forms the basis for the Europeana project’s Thought Lab. The system received an honorable mention in the In-Use track of the 2008 International Semantic Web Conference (Wielemaker et al., 2008).

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

2. Metadata in the Demonstrator

FIGURE 1: Metadata and vocabularies in the demonstrator. Solid circles are metadata records, empty circles are vocabularies. One-sided arrows denote usage (e.g. RKD metadata uses IconClass), two-sided arrows denote that

equivalence relations exist between two vocabularies. The numbers represent #records or #concepts. Before we go into options for metadata conversion with schemas, we give an impression of both the raw data and the end result in our project. Figure 1 illustrates that we integrated several collections by linking them through both the institution’s own vocabularies as well as external ones. The raw data records originated from six sources in different domains:

• Rijksmuseum (paintings and art objects); • Bibliopolis (book manufacturing); • Royal Tropical Institute museum (ethnographic; abbreviated “Tropenmuseum”); • Rijksmuseum Volkenkunde Leiden (ethnographic; abbreviated “RMV”); • Louvre museum Paris (paintings and art objects); • Artchive.org (mainly paintings); and • Netherlands Institute for Art History (paintings, drawings, photographs; abbreviated

“RKD”). In total more than 262,000 records were converted (a portion of records was not converted because they lacked an image or useful metadata). The raw records have the following features:

• There is one record per work that contains metadata on both the work and its digital image. No separation is made into e.g. Work, Expression, Manifestation or Item as in the FRBR model (IFLA 2008);

• Elements found in the majority of collections (but not all): title, creator, date, type of object, description, material, dimensions, location. For example, the ethnology collections tend to use collector instead of creator because the latter is often unknown;

• Specialized elements that appear in only one or two collections: elements for creator roles (e.g. photographer, engraver), location of work within the building (floor, wing), subject matter (subject.location, subject.iconography), time (before, after, date.earliest, time

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

period such as 20th century), series work belongs to, expositions work was used in, text on the object (title on object, title in documentation).

• Metadata elements that refer to images are: copyrightholder, source, type (e.g. black/white photo)

The original syntactical representation of the collections is either XML or a database format (in the latter case an XML export of the database was made). After analyzing the data, the project was confronted with the task of converting to RDF. To illustrate the possible choices in the next sections, we use the fictitious record in Figure 2 (based on an Artchive record) as an example.

Creator: Rembrandt van Rijn Title: Jeremiah lamenting the destruction of Jerusalem Year: 1630 Type: Oil on panel Type2: jpeg Size: 58.3 x 46.6 cm Location: Rijksmuseum, Amsterdam Photographer: John Johnson

Figure 2: Fictitious metadata record. Modified version of an existing Artchive record.

3. Conversion of Metadata using Schema Levels There are two basic options when converting metadata: either develop a schema for each individual collection, or choose a more generic schema to which all collections are converted directly. The first option has as benefit that all information of the collection is retained, but the drawback that it remains unclear which elements are shared over collections (one collection might use author, another creator, a third painter). Applications need to be aware of the overlap in meaning to perform search and presentation. For example, a facet browser needs to show users one creator facet, containing the creators from all collections. rijks:creator Rembrandt van Rijn dct:creator Rembrandt van Rijn rijks:title Jeremiah ... dct:title Jeremiah ... rijks:year 1630 dct:date 1630 rijks:period Baroque dct:temporal Baroque rijks:type Oil on panel dct:medium Oil on panel rijks:type2 jpeg dct:type jpeg rijks:size 58.3 x 46.6 cm dct:extent 58.3 x 46.6 cm rijks:location Rijksmuseum, dct:spatial Rijksmuseum, Amsterdam

Amsterdam rijks:photographer John Johnson dc:contributor John Johnson

Figure 3: Two options for representing metadata records: collection-specific schema (left) and generic schema (right)

For this reason the second option seems more suitable. Projects like the Memory of the Netherlands (over seventy-five Dutch collections) use this approach, with Dublin Core as the shared schema (see Figure 3, right side). The drawback is that the metadata becomes too abstract to provide domain-specific functionality expected by the user. For example, our demonstrator displays search results using a graphical timeline, which includes styles and periods that co-occured with the production of the works. It uses date.creation of works, and startdate/enddate of styles/periods to generate the timeline. The DC framework lacks the granularity needed to generate a timeline: it does not separate works from styles/periods, and dct:temporal does not make a distinction between start/enddates, creation dates and other temporal information.

We use a hybrid approach instead: (1) represent collections with their own metadata schema; (2) map collection-specific schema to a domain-specific schema; and (3) map domain-specific schema to generic schema. In our project a domain-specific schema called VRA Core was used, other schemas such as CIDOC/CRM can also be used. Some example mappings that realize this are shown in Figure 4.

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

rijks:period rdfs:subPropertyOf vra:stylePeriod . rijks:type rdfs:subPropertyOf vra:material . ... vra:stylePeriod rdfs:subPropertyOf dct:temporal . vra:material rdfs:subPropertyOf dct:medium . ...

Figure 4: Example mappings linking three levels of metadata.

We argue that the domain-specific level is crucial. It strikes a balance between providing a homogeneous view on the data, and providing information that is specific to the domain. The collection-specific level has numerous schemas, and does not allow a homogeneous view on the collections. Creating generic search and presentation features on top of them is difficult. The generic (DC) level provides a homogeneous view on the collections, but search and presentation cannot be tailored to the user’s needs (e.g. timeline). In our hybrid approach, the application works mostly on the domain-level, so that the role of the generic schema (DC) is mostly to provide interoperability with other systems.

4. The VRA Core 3.0 Application Profile In this Section we describe the VRA Core 3.0 AP, which we used as our domain-specific schema.

4.1 Introduction to VRA Core 3.0 VRA Core 3.0 is a standard defined by the Visual Resource Association consisting of over 600

active members (VRA, 2010). It defines seventeen main elements and thirty-seven qualifiers for those elements. A few of the elements are similar to DC elements, e.g. vra:subject, most have a more specific meaning, e.g. vra:culture. VRA Core is explicitly defined without an accompanying syntactical format.

VRA element/qualifier Range Dublin Core Example/Meaning Record Type {work,image} Type type of record Type AAT Type “print”, “sculpture”, “digital” Title.Translation text Title “As Sucedi” Measurements.Dimensions text Format “24.5 x 35 cm” Measurements.Format text Format “jpeg” Material.Medium AAT Format “ink” Material.Support AAT Format “paper” Technique AAT Format “etching” Creator ULAN, AAAF Creator, Contributor “Francisco de Goya” Creator.Role Controlled list Creator, Contributor “sculptor” Date.Creation text Date, Coverage “1985”, “5th century”, “ca. 1890 Date.Alteration text Date, Coverage Date.Restoration text Date, Coverage Location.CurrentSite BHA, AAAF Contributor, Coverage “Madrid (ESP)” Location.FormerSite BHA, AAAF Contributor, Coverage Location.CreationSite BHA, AAAF Contributor, Coverage Location.CurrentRepository BHA, AAAF Contributor, Coverage “Ann Arbor, Un. of Michigan Museum of Art” IDNumber.FormerRepository Identifier IDNumber.CurrentAccession Identifier Style/Period.Style AAT Coverage, Subject “Impressionist” Style/Period.Group AAT Coverage, Subject “Fauve” Style/Period.School AAT Coverage, Subject “School of Rembrandt” Culture AAT, LCSH Coverage “Dutch” Subject AAT, TGM Subject What the object depicts or expresses Relation Relation Relationships with other works Description text Description Source Source Reference to source of annotation Rights Rights Who owns rights of the object/image

Table 1: Selection of elements from the VRA Core Categories specification. Range and mapping to Dublin Core are those suggested by the specification. Most of the examples also come from the specification.

VRA Core focuses on the domain of visual art, dividing resources into Works (e.g. Rembrandt’s Nightwatch) and Images (photographs, X-rays, …). Table 1 shows a selection of

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

elements. In the future we plan to investigate the usage of the newer VRA Core 4.0, which was not available in 2005 when we started our work.

4.2 Background and choice for VRA Core 3.0 Our decision for VRA Core over other options such as Dublin Core, FRBR and CIDOC/CRM

has the following reasons:

• VRA Core elements map nicely to elements found in the raw data; • The link between VRA Core and Dublin Core is straightforward; this made it possible to

standardize our infrastructure (ClioPatria) on Dublin Core-like schemas; • Frameworks such as CIDOC/CRM are more complex than our metadata demands, e.g.

because it uses events as a central modeling concept. The element annotation publisher=John Smith requires five entities and four properties to be represented in CIDOC/CRM (Kakali et al., 2007);

• The collections make no fine-grained distinctions such as Work, Expression, Manifestation, Item in FRBR model. All metadata is attached to one record. For our system only the distinction between the intellectual content (Work, Expression) and the physical carrier (Manifestation, Item) is important;

• Such a distinction allows enrichments; we later added e.g. resolution and annotations of the pictorial content of regions of the image (e.g. man, city, crying, fire);

• It is possible to automatically decide which element-value pairs in the original records should be attached to either a Work or an Image; in our collections these are only a few properties (source, type, copyrightholder);

• It might be difficult to automatically select which metadata belongs to Manifestations and which to Items.

4.2 Creating a Description Set Profile for VRA Core An AP is a specification that defines a specialization of Dublin Core (Nilsson et al., 2008). It

contains Functional requirements, Domain Model and Description Set Profile (DSP). We will focus here on working out the DSP. A DSP is the formal description of the metadata schema (Nilsson, 2008). It defines “templates” for describing resources. Templates consist of:

• list of the elements and relationship to other elements; • types of resources the metadata describes in this context; • how many times an element may occur for one resource; • what values are allowed for an element.

We have created an RDF schema for the elements by turning each element into a property and

each qualifier into a subproperty, see http://thesauri.cs.vu.nl/vra/vracore3.rdfs. The types of resources the metadata describes are: Work and Image. We created two classes of those names in the schema. We added a property vra:depictedBy between vra:Work and vra:Image, because VRA does not specify this link. The specification further states that each element may be applied as many times as necessary, to both Works and Images. To realize this in the schema we made a superclass vra:VisualResource. The properties have this superclass as domain. Lastly, the AP should describe allowed values, e.g. AAT for technique is recommended by VRA. We discuss allowed values in the next section.

The VRA specification indicates how VRA Core specializes Dublin Core through “mappings” to DC. We have represented these as rdfs:subPropertyOf statements. However, the recommendations are not without issues. Firstly, it does not use the newer, specialized properties in DC Terms. We can use DC Terms to create better mappings (see http://thesauri.cs.vu.nl/vra/vracore3-dc-additional.rdfs):

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

• vra:material to dct:medium (“The material or physical carrier of the resource”); • vra:measurements.dimensions and vra:measurements.resolutions to dct:extent (“The size

or duration of the resource”); • vra:title.variant and vra:title.translation to dct:alternative (“An alternative name for the

resource”); • vra:largerEntity and vra:series to dct:isPartOf, (“A related resource in which the

described resource is physically or logically included”); • vra:date.creation to dct:created (“Date of creation of the resource”); • vra:date.alteration to dct:modified (“Date on which the resource was changed”).

Secondly, our interpretation of the mapping specification results in some VRA elements being

mapped to multiple DC elements. We think only one DC element might be appropriate. For example, we see no reason to map vra:date to dct:coverage. The reason VRA included that mapping might be that some metadata may misuse that element to indicate the topic of a work is a date. This should be solved in data conversion, not in the schemas. Thirdly, the definition of dct:coverage makes it hard to decide about appropriate mappings. Is that element intended as a sub-element of dct:subject (i.e. pertaining only to spatial and temporal aspects of subjects), or is it solely intended for spatial/temporal aspects of the object itself? The latter case seems less likely, as there already exists dct:date for the temporal aspect. That interpretation would mean that a spatial property dct:location is missing. If then the former case is the right one, the mappings to dct:coverage should be removed for vra:location and vra:style/period.

Source (VRA) Target Possible solution date dc:date, dc:coverage delete mapping to coverage creator dc:creator, dc:contributor map new vra:contributor to dct:contributor style/period dc:coverage, dc:subject map to dct:temporal or dct:subject location dc:contributor, dc:coverage map to dct:spatial, or create new dct:location currentRepository, formerRepository

vra:location map to dct:publisher

series, largerEntity vra:title map to dct:isPartOf role vra:creator 3-aried relation, rule-based mapping to dct:creator

Table 2: Problematic VRA-DC mappings and possible solutions.

Fourthly, an element vra:contributor is missing to specialize DC. Fifthly, the mapping of

vra:currentRepository and vra:formerRepository to vra:location might not be correct, because the specification uses them to refer to organizations, not locations. A mapping to dct:publisher might be more appropriate. We summarized problems and possible solutions in Table 2.

4.3 Indicating Collection-specific Value Ranges An AP should specify allowed values for elements. This can be realized in the DSP by

specifying an rdfs:range on the properties. The ranges can be obtained from the recommendations in the VRA specification, e.g. the Art and Architecture Thesaurus (AAT) for vra:material. This is not an appropriate solution in heterogeneous domains. For example, the Dutch the Rijksmuseum uses AAT as the source for object type, while the National Ethnographic Museum uses SVCN (a vocabulary derived from AAT). A mechanism is needed that can encode the expected values per collection: collection-specific value ranges. One way to realize this is with OWL restrictions. Each collection then gets a subclass of Work/Image on which the restrictions are specified. As an experiment we defined these value ranges for several elements of the Artchive collection. We created a subclass ac:Work and applied OWL restrictions on that class to elements such as style/Period, material and creator (see http://eculture.cs.vu.nl/resources/annotationScheme.rdfs). These properties were restricted to

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

values from aat:styles_and_periods, aat:materials, and ulan:person, respectively. The restrictions can be used in supporting manual annotation by users, which we realized in our annotation interface (see Hollink, 2006; p88). It can help annotators quickly find the right value. A similar solution for representing value ranges is proposed by Nilsson (2008), with the difference that there constraints on the data are encoded. This allows automated checking of the data, which is more difficult with OWL.

5. Semantic Search and Presentation with Layered Metadata We illustrate the potential of layered metadata by discussing search, navigation and

presentation features that benefit from such a representation.

Figure 4: Screenshot of three result clusters for the query “Rembrandt”. Hovering over the lower-left result shows the

path from the work to the matching literal, which uses the relationship “after drawing by”.

5.1 Result Clustering A search on a keyword such “Rembrandt” could result in a long list of matches. This leaves the

user without an overview of the nature of the results, and offers no way to intelligently navigate the results. The demonstrator attempts to alleviate this problem by clustering results and explain the reason for clustering (Schreiber et al., 2008). The system first finds all literals in the metadata that match, and then computes paths from these literals to instances of vra:Work. It then clusters the Works based on overlapping paths. A screenshot of the current demonstrator’s results is shown in Figure 4. Figure 5 shows some results and paths to illustrate clustering (works 3 and 4 are not real; included for illustration purposes only).

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

ex:work1 –dc:creator-> ulan:Rembrandt –rdfs:label-> “Rembrandt” ex:work2 –dc:creator-> ulan:Rembrandt –rdfs:label-> “Rembrandt” ex:work3 –vra:currentRepository-> ex:rHouse –rdfs:label-> “Rembrandt House” ex:work4 –vra:currentRepository-> ex:rHouse –rdfs:label-> “Rembrandt House” ex:work5 -rijks:afterDrawingBy -> ulan:Rembrandt -rdfs:label-> "Rembrandt" ex:work6 -rijks:afterPaintingBy -> ulan:Rembrandt -rdfs:label-> "Rembrandt"

Figure 5: Some results and paths for the keyword search "Rembrandt" These paths can be generalized into the following clusters: works by Rembrandt (works 1 and 2), works in the Rembrandt House museum (3 and 4), and works based on a work by Rembrandt (5 and 6). Other types of clusters appear in other searches, e.g. “works by artists in the same style”, “works by students of X”, “works depicting same object X”. The names of the clusters are generated by a natural language pattern linked to a property or path (e.g. currentRepository to “in museum X”). Each of the three clusters from Figure 5 is formed based on elements from a different level: the first on the generic level (DC), the second on the domain-specific level (VRA), the third on the collection-specific level (Rijks). Actually, the current demonstrator generates one cluster for works 5 and 6 named “works related to someone”, because rijks:afterDrawingBy and rijks:afterPaintingBy only share vra:relation as superproperty. This can be improved easily by adding a new property mn:isBasedOn below vra:relation and above the two rijks properties. In effect we are adding domain knowledge currently missing in VRA.

A clustering application that is only aware of DC (i.e. has no knowledge of domain-specific schemas such as VRA) can still show some relevant clusters. The second and third clusters would appear as “spatially related” and “related works” clusters, based on “dumbing down” of the vra and rijks properties to dct:temporal and dct:relation.

5.2 Faceted Browsing Faceted browsing allows users to interactively constrain the results by selecting values from

facets. A known problem is that there are too many facets, because each property in the RDF data is a candidate (Hildebrand et al., 2006). VRA Core offers a natural set of “default” facets that is more informative than DC (e.g. style/period, culture), but abstracts from facets that are only important for one or two collections. Ideally the facet browser can adapt the current facet selection based on the data or the user’s current constraints. For example, the browser should automatically switch to a facet based on dct:format instead of more specific properties (e.g. vra:measurement and vra:type) if the records lack them. (Another reason to use the generic metadata level is that the results come from different resource types, e.g. videos or books, so that they share little domain-specific properties.)

The opposite should happen when the users “homes in” on a small subset of the data that contains collection-specific semantics. It becomes useful to show facets based on collection-specific elements such as photographer or wing, if they are available in the data. This functionality is partly implemented in the demonstrator.

5.3 Annotation tool In a separate experiment the project developed an annotation tool for experts of the Prints

collection of Rijksmuseum Amsterdam (Hildebrand et al., 2009). The experts wanted a simple interface with four slots: who, what, where, when. This could be realized easily by representing the four W’s as properties, and linking these properties to lower levels (e.g. dct:creator, dct:collaborator and subject.person to who). Already existing annotations (in VRA/DC) could be shown in the 4W slots (dumbing down is done automatically by ClioPatria), and new annotations could be added in these slots by the experts. Allowed values for each slot could be indicated with OWL Restrictions as explained in Section 4.3. This example shows that our approach (leveled metadata) is flexible; new levels can be added on the fly.

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

Annotators were offered an autocompletion field for selecting vocabulary concepts. The interface shows background information on concepts, including images of works annotated with a concept (see Figure 6). This requires the vra:depictedBy property. Thus the annotation tool needed to be aware of several levels of metadata (VRA and 4W) at the same time.

Figure 6: Annotation field “What”, showing suggested concepts for query “siege”. Details are shown for the concept “siege” from IconClass, including two images annotated with that concept.

6. Discussion and Conclusions In this paper we investigated how to represent heterogeneous metadata collections in RDF for

usage in generic semantic search, navigation and presentation features, while preserving the original data’s meaning as closely as possible. We found that an approach using several levels of metadata (collection-specific, domain-specific, generic) strikes the best balance between these needs. It preserves the original metadata, and provides an integrated view of heterogeneous collections at the same time. A benefit of having several levels is that the application can choose the most relevant level, or even mix them. For example, interesting clusters can appear at all three levels in one search result. When required, the application may switch between levels (e.g. in facet browsing). The approach is flexible in that missing properties can easily be added (isBasedOn), as well as whole new levels (who, what, where, when).

Using a domain-specific level is particularly useful. For example, VRA provides a coherent set of initial facets for a facet browser. Its properties also allow generation of useful clusters such as museum a work is in, work of artists in the same style, works by students, works from the same culture, etcetera. Most of the time the search and presentation facilities work on the VRA level without the need for Dublin Core or collection-specific schema. In our approach the generic (DC) is mostly used to provide for data interoperability with other applications.

The role of a domain-specific schema can be fulfilled by creating one from scratch or reusing an existing one. We prefer the latter option as it provides “free” interoperability. VRA Core matches our collections’ metadata quite nicely, and is compatible with DC. Its separation of resources in Works and Images is workable for our collections. We are able to automatically separate element-value pairs of each raw record into RDF instances of Work and Image. This prerequisite to our approach might not work in all domains; this is future work.

We defined an Application Profile for VRA Core, focusing on specifying the Description Set Profile in RDF. The benefits of RDF are its simple (triple-based) model, and that its “dumb down” mechanism rdfs:subPropertyOf can be efficiently implemented. DSPs require that allowed element values are specified. We observed that it is harmful to specify these in a DSP when we

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

are dealing with heterogeneous collections. Different collections use different vocabularies. We found a way to represent them on the collection level (instead of on the domain level) using OWL Restrictions.

During the development of the RDF schema we found that our interpretation of mappings between VRA Core and Dublin Core as subproperties causes some problems. Some of our mappings from VRA to DC seem to be wrong, and we were unsure how to deal with the dct:coverage element. In this case we feel these are minor issues which can be resolved through discussion with the VRA and DC communities.

Our approach is interesting for projects like Europeana. It currently only uses the generic (DC) level, which does not allow search and presentation features that users need. Switching to our approach will require several domain-specific schemas (possibly MARC, MPEG-7), as Europeana covers more domains than just visual art. It also requires more research into scalable implementations of RDF(S), as current infrastructure does not yet scale to the size of its collections.

Acknowledgements The authors acknowledge the work of the members of the MultimediaN E-Culture project in creating the VRA Core 3.0 schema, conversion of collections and developing the technological infrastructure for search, navigation and presentation functionality.

References Hildebrand, M., J. van Ossenbruggen, L. Hardman. (2006). /facet: A Browser for Heterogeneous Semantic Web

Repositories. In Proceedings of the 5th International Semantic Web Conference, 2006, pages 272-285. Hildebrand, M., J. van Ossenbruggen, L. Hardman, G. Jacobs (2009). Supporting subject matter annotation using

heterogeneous thesauri, a user study in Web data reuse. In International Journal of Human - Computer Studies, 2009, 887-902.

Hollink, L (2006). Semantic annotation for retrieval of visual resources. PhD Thesis, Vrije Universiteit Amsterdam. IFLA (2008) Functional Requirements for Bibliographic Records. Retrieved April 4, 2010, from

http://www.ifla.org/files/cataloguing/frbr/frbr_2008.pdf Kakali, Constantia, I. Lourdi, T. Stasinopoulou, L. Bountouri, C. Papatheodorou, M. Doerr and M. Gergatsoulis (2007)

Integrating Dublin Core metadata for cultural heritage collections using ontologies. Proceedings of the International Conference on Dublin Core and Metadata Applications, 2007, 128-139

Nilsson, M (2008). Description Set Profiles: A constraint language for Dublin Core Application Profiles. 2008. Retrieved April 4, 2010, from http://dublincore.org/documents/dc-dsp/.

Nilsson M., Baker, T., Johnston P. (2008) The Singapore Framework for Dublin Core Application Profiles. Retrieved March 10, 2010, from http://dublincore.org/documents/singapore-framework/.

Schreiber, G., A. Amin, M. van Assem, V. de Boer, L. Hardman, M. Hildebrand, L. Hollink, Z. Huang, J. van Kersen, M. de Niet, B. Omelayenko, J. van Ossenbruggen, R. Siebes, J. Taekema, J. Wielemaker, and B. Wielinga. (2006) MultimediaN E-Culture Demonstrator. Proceedings of the International Semantic Web Conference, 2006, 951-958.

Schreiber, G., A. Amin, L. Aroyo, M. van Assem, V. de Boer, L. Hardman, M. Hildebrand, B. Omelayenko, J. van Ossenbruggen, A. Tordai, J. Wielemaker, and B. Wielinga. (2008). Semantic annotation and search of cultural-heritage collections: The MultimediaN E-Culture demonstrator. J. Web Semantics, 2008, 6(4):243-249

Tordai,A., B. Omelayenko and G. Schreiber (2007). Semantic Excavation of the City of Books. In Proceedings of the Semantic Authoring, Annotation and Knowledge Markup Workshop (SAAKM2007), pages 39–46. CEUR-WS.

Visual Resource Association (2004). VRA Core 3.0. Retrieved April 4, 2010, from http://www.vraweb.org/resources/datastandards/vracore3/index.html

Visual Resource Association (2007). VRA Core 4.0 Introduction. Retrieved April 4, 2010, from http://www.vraweb.org/projects/vracore4/VRA_Core4_Intro.pdf

Visual Resource Association (2010). Brief History of the VRA. Retrieved April 4, 2010, from http://www.vraweb.org/about/index.html

Wielemaker, J., M. Hildebrand, J. van Ossenbruggen, and G. Schreiber (2008). Thesaurus-based search in large heterogeneous collections. Proceedings of the International Semantic Web Conference, 2008, 695-708.

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

After our conversion process the sample record will look like this:

<http://www.artchive.com/artchive/r/rembrandt/jeremiah.jpg> a ac:Image ; rijks:photographer <ec:JohnJohnson> ; vra:type <ec:jpeg> . vra:resolution "..." ;

<http://t-d-b.org?http://www.artchive.com/artchive/r/rembrandt/jeremiah.jpg> a ac:Work ;

vra:creator <ulan:Rembrandt> ; vra:title "Jeremiah lamenting ..."@en-us ; vra:date 1630^xsd:year ; vra:period <aat:Baroque> ; vra:material.medium <aat:Oilpaint>; vra:material.support <aat:Panel> ; vra:dimensions: "58.3 x 46.6 cm" ; vra:currentRepository: <ec:Rijksmuseum> ; vra:depictedBy <http://www.artchive.com/artchive/r/rembrandt/jeremiah.jpg>.

Notice that in many cases the collection-specific property is almost identical in meaning to VRA properties, in which case we replaced them with the latter ones. We’ve also split the metadata properties over two entities, Work and Image. Then we enriched the data with vra:resolution, and separated the original property-value pair type=oil on panel into values oilpaint and panel for properties medium and support. Literals have been replaced with URIs from vocabularies such as the Getty Art an Architecture Thesaurus (AAT) and Universal List of Artist Names (ULAN) to produce the RDF graph in Figure A. We’ve also generated a URI for the Work that allows one to dereference it on the Web. The redirection service t-b-d.org makes it possible to return the image to which the Work refers. More details on conversion can be found in (Tordai, 2007).

3.5 Converting Literals in Original Records to URIs in RDF In each of the approaches above, one can choose to convert the metadata values in the original records to literal values, or to use URIs from a vocabulary. URIs are essential if the application should support semantic search. The E-Culture project attempts to replace literals with URIs using the basic recipes described below. See (Tordai et al., 2007) for more details.

• If the collection uses its own vocabulary, the vocabulary has to be converted to RDF first. Then the records’ metadata values have to be matched to the concept labels of the vocabulary to select the right URI. Often this is not trivial; indexers do not always use the exact labels used in the vocabulary, and sometimes add information which complicates interpretation;

• If the collection does not use its own vocabulary, a suitable one has to be selected. The matching problem is then usually more difficult than in the previous case;

• Vocabularies that have overlapping concepts should be mapped. In the ideal case there is a set of central vocabularies that forms the “bridge” between the collection-specific ones. In the MultimediaN case the Getty vocabularies AAT, ULAN and TGN play this role (see Figure A).

Some typical cases that complicate interpretation with an example:

• Concept not in vocabularies: “Arnold Boecklin”, “cave painting”, “Egyptian”, “Mali, inland delta of the Niger River”

• Misspellings: “Delacriox, Eugene” • Nicknames and descriptions: “Canaletto (Giovanni Antonio Canal)”, “Wu Chen (1280-

1354)”

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

Special matching techniques can overcome some problems (e.g. spelling mistakes), in other cases manual matching may be unavoidable. When the concept is not present in the vocabulary at all, the vocabulary can be extended by hand (e.g. create a new skos:Concept for “cave painting” and link it to an AAT concept with skos:broader).

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

6. Best Practices in Conversion for Semantic Search Based on our experiences we can formulate some best practices for data conversion that enables semantic search:

There are a few principles from the Semantic Web community in which we operate that we found useful in guiding our choices: the usage of URIs to identify and link concepts, the usage of concept hierarchies (SKOS), specialization/generalization relationships between object attributes or “properties” (equivalent to the “dumb down” principle).

Firstly, make as much of the implicit connections between collections explicit by using URIs instead of literal values. For this goal we also converted vocabularies such as the Getty Institute’s Universal List of Artist Names (ULAN), and mapped URIs for e.g. “Van Gogh” in individual collections to the ULAN URI for Van Gogh. The potential of this principle to enable semantic search can be illustrated with the following example: a keyword search for “Impressionism” will find works by Van Gogh, even if the records have not been annotated with Impressionism. Semantic search can find the records because another vocabulary states that Van Gogh was a (Post-)Impressionist. In this paper we describe other examples of semantic search that benefit from using URIs instead of literals, such as clustering of search results.

The second principle is to standardize the metadata schema, while allowing specializations. All records that are converted should use the standard as much as possible. Information that does not fit the standard directly is stored in a specialization of the standard. This allows the application to work on one coherent metadata schema, without losing information that is specific to a collection. For the standard metadata element set we chose the VRA Core 3.0, a specialization of Dublin Core aimed at visual art. The converted collections can thus be used on three levels, where each next level is “dumbed down” from the previous: (1) collection-specific level; (2) visual art level (VRA Core); (3) generic metadata level (Dublin Core). We show that each specific level provides a little more functionality than the former, but that each level contributes to the demonstrator’s capabilities. (The newer VRA Core 4.0 was released as a beta in late 2005, definitively in 2007. The changes from 3.0 were the following: three of the elements were split them into new elements, one element was renamed, two new “main” elements were added (VRA, 2007)) Nine VRA elements can be mapped to a more specific DC element, e.g. vra:material to dc:medium. We provide this mapping at … Missing DC element for physical location? Mapping vra:location to dc:coverage not correct. Usage guidelines: no real restrictions on allowed records; use URIs as much as possible Syntax Guidelines and Data Formats: RDF(S)/OWL, SKOS-like vocabularies

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

From these functionalities we can derive requirements on the representation of the metadata and the metadata element set used in that representation:

• The implicit graph structure formed by records, values and vocabularies should be made as explicit as possible; i.e. metadata element values should be URIs when possible, not literal values;

• Metadata about the work should be separated from metadata about the image; • All information on time and geographic location in the records should be converted and

accessible in the RDF repository; • Domain-specific elements that allow users to do faceted browsing should be present in

the element set. The elements should allow coverage of typical user queries through the facets (each element corresponds to a possible facet). Typically users are interested in queries that relate to Who, What, Where, When, Why and How;

• Ability to indicate collection-specific value ranges in the metadata schema. The first requirement suggests that the range definitions of the schema should not assume either literal or URI values. We need to deal with this in our RDF representation of VRA Core (see next section). The second requirement is addressed by VRA Core’s separation of resources into Works and Images. The third requirement is covered by vra:date and vra:location. The fourth requirement can be fulfilled by VRA Core except for the Why. Fortunately, this type of data is also not present in the actual data, except in literal descriptions. The fifth requirement will be addressed in Section […]

Who creator What object type, subject, material When date Why ... description How technique

Table X: relationship between typical UI facets and VRA Core properties.

[information on VRA Core spec, table of its elements and meaning, choices made in conversion; the added link between Works and Images]

6. Collection-Specific Value Constraints [add something about not constraining range to either URI or literal; cannot control what form we will get the RDF, not always in control!] The VRA specification recommends vocabularies as range for some elements, for example the Art and Architecture Thesaurus (AAT) for Material. We do not encode these recommendations, because different collections may use different vocabularies. In general, we think such restrictions are not appropriate at the level of an AP. However, they are appropriate and useful at the level of collections. We call this feature collection-specific value ranges. For example, the

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010

Dutch the Rijksmuseum uses AAT as the source for object type, while the National Ethnographic Museum uses SVCN (a vocabulary derived from AAT).

Collection Element Value range Rijksmuseum … … RMV

Table X: A list of elements from particular collections that have a value range. Sometimes only a part of a vocabulary is relevant. For example, the objects in a museum of modern art will have values for the style/period element that are located below AAT’s modern European styles and movements concept. Other parts of the Styles and Periods facet are not relevant. Indicating vocabularies used helps in annotation (the appropriate part of the vocabulary can be shown to annotators) and in checking data for correctness (values outside the specified vocabulary part are likely erroneous). We repeat that this is feature does not strictly belong to the AP, but it should be specified by each application that uses the VRA Core AP. The RDF encoding for the E-Culture demonstrator can be found at: http://... An example is given in Figure X that shows how we can indicate which part of a vocabulary is relevant.

vp:parent a owl:TransitiveProperty . ex:MyWorkClass a owl:Class ;

rdfs:subClassOf vra:Work ; rdfs:subClassOf

[ a owl:Restriction ; owl:onProperty vra:material ; owl:allValuesFrom

[ a owl:Restriction ; owl:onProperty vp:parent; owl:hasValue aat:material ]

] .

Figure X: Example of a collection-specific value range for the property vra:material. A restriction is added to the collection-specific class ex:MyWorkClass that states that only values below aat:material are allowed. This only

works if the hierarchical relation (in this case vp:parent) is transitive.

6.2 Representing Collection-Specific Value Ranges

vp:parent a owl:TransitiveProperty . ex:MyWorkClass a owl:Class ;

rdfs:subClassOf vra:Work ; rdfs:subClassOf

[ a owl:Restriction ; owl:onProperty vra:material ; owl:allValuesFrom

[ a owl:Restriction ;

2010 Proc. Int’l Conf. on Dublin Core and Metadata Applications

owl:onProperty vp:parent; owl:hasValue aat:material ]

] .

Figure X: Example of a collection-specific value range for the property vra:material. A restriction is added to the collection-specific class ex:MyWorkClass that states that only values below aat:material are allowed. This only

works if the hierarchical relation (in this case vp:parent) is transitive.

OWL Pattern for collection-specific value ranges. USE THIS PATTERN TO RESTRICT THE VALUES OF A METADATA ELEMENT TO PART OF A VOCABULARY. THE PATTERN IS APPLICABLE WHEN THE VOCABULARY IS DEFINED AS A HIERARCHY BETWEEN INSTANCES. THE ELEMENT IS REPRESENTED BY PROPERTY P, THE HIERARCHICAL PROPERTY BY H, THE HIERARCHY PART HAS CONCEPT C AS MOST GENERIC CONCEPT.

• create a class W that will act as placeholder for collection-specific value ranges. • define a restriction on P that states that all values are allowed that are related by property H to concept C; • make W a subclass of this restriction; • define that H is transitive;

This pattern happens to be almost identical to one discussed in W3C’s “Classes as values” note (SemanticWeb Best Practices and DeploymentWorking Group 2005c), as “Approach 1”. The note addresses the situation where one wishes to restrict the values of a property to a particular class and its subclasses. In this situation almost the same pattern can be applied. The main difference is that the second restriction is defined on the hierarchical property between classes (rdfs:subClassOf) instead of on a hierarchical property between instances (e.g. skos:broader). [say something about differing semantics of OWL and the DC constraint language (although for the last no formal semantics available)]

MATERIAL Qualifiers: Material.Medium Material.Support Description: The substance of which a work or an image is composed. Data Values (controlled): AAT VRA Core 2.0: W5 Technique CDWA: Materials and Techniques-Processes or Techniques- Name Dublin Core: FORMAT

Figure R. Original description of VRA element “Material” in the VRA Core specification. Mappings to elements of

three other element sets are given: VRA Core 2.0, Dublin Core and CDWA.

Proc. Int’l Conf. on Dublin Core and Metadata Applications 2010