is my:sameas the same as your:sameas?frankh/postscript/kcap2017.pdf · • computing methodologies...

Is my:sameAs the same as your:sameAs?Lenticular Lenses for Context-Specific Identity

Al Koudous Idrissou1,2, Rinke Hoekstra1,Frank van Harmelen1, Ali Khalili1

1 Department of Computer Science,Vrije Universiteit AmsterdamAmsterdam, The Netherlands

[email protected]

Peter van den Besselaar22 Department of Organization Sciences,

Vrije Universiteit AmsterdamAmsterdam, The Netherlands

ABSTRACTLinking between entities in different datasets is a crucial elementof the Semantic Web architecture, since those links allow us tointegrate datasets without having to agree on a uniform vocabulary.However, it is widely acknowledged that the owl:sameAs constructis too blunt a tool for this purpose. It entails full equality betweentwo resources independent of context. But whether or not tworesources should be considered equal depends not only on theirintrinsic properties, but also on the purpose or task for which theresources are used. We present a system for constructing context-specific equality links. In a first step, our system generates a set ofprobable links between two given datasets. These potential linksare decorated with rich metadata describing how, why, when andby whom they were generated. In a second step, a user then selectsthe links which are suited for the current task and context, con-structing a context-specific “Lenticular Lens”. Such lenses can becombined using operators such as union, intersection, differenceand composition. We illustrate and validate our approach with arealistic application that supports researchers in social science.

CCS CONCEPTS• Computing methodologies → Knowledge representationand reasoning;

KEYWORDSowl:sameAs, linkset, lens, data integration

1 INTRODUCTIONConstructing and maintaining links between corresponding enti-ties in different datasets and ontologies is a crucial element of theSemantic Web architecture. After all, these links are responsible forthe integration of datasets and ontologies published by multipleindependent parties without the requirement to agree a priori on auniform vocabulary, allowing the Semantic Web to scale.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] 2017, Knowledge Capture Conference, December 4–6, 2017, Austin, TX, USA© 2017 Association for Computing Machinery.ACM ISBN 978-1-4503-5553-7/17/12. . . $15.00https://doi.org/10.1145/3148011.3148029

The quality of such correspondence links depends on how wellthe requirements for the correspondence fit the formal semanticsof the predicate used to express it. Linking across datasets for infor-mation retrieval purposes poses different constraints than use caseswhere statistical analysis or automated decision making play a role.And similarly, a false positive skos:related relation has less farreaching consequences than a misguided :sameAs. As long as thedata is used in isolation, and the links are well understood, thisproblem is manageable. However, when data and links are sharedon the Web, the context in which a link was deemed to be true islost. This is especially problematic for transitive predicates thatformally state equality between resources. For instance, [9] showsthat :sameAs is often misused. As you follow a chain of :sameAs

triples, the stated equivalence between the two outer ends of thechain becomes increasingly tenuous.1 At the same time [3] arguesthat the use of weaker alternatives such as rdfs:seeAlso reduces theutility of the relation to express the intended semantics. It wouldseem that different linktypes are required in different circumstances.Or rather, that the truth value of an equality assertion depends onthe context for which it was generated.

We can illustrate the problems with the semantics of :sameAs

with an example from a scientific domain that performs analysisacross multiple datasets: the field of Science, Technology and In-novation (STI) studies the dynamics of scientific ideas [8]. For this,the field depends on a large variety of heterogeneous data.

To study the success of scientific organizations, STI researchersneed to align such organizations across datasets that describe or-ganisations across various countries, e.g. GRID2 and OrgRef3. The3M corporation, a large multinational organisation with a substan-tial patent portfolio, occurs in both datasets. GRID distinguishesbetween national 3M branches across six countries, while OrgRefonly refers to a single 3M entity. Should these entities be designatedas “the same” across these datasets? It depends. For a study thataims to compare organizations at a global level, they should, for acomparison across countries, they shouldn’t.

Similar issues arise in other areas: in biochemistry, users cannotuniformly decide in what context two entries about a molecule indifferent datasets are the same [5]; in pharmacology, should drugsbe considered equal if they have the same brand name, the samechemical structure, or some other criteria; in the humanities, shouldtwo imprints of the same manuscript be considered equal (same

1This is one of the reasons that the SKOS matching relations are not transitive.2See https://grid.ac/3See http://www.orgref.org/web/download.htm

https://doi.org/10.1145/3148011.3148029

https://grid.ac/

http://www.orgref.org/web/download.htm

text, same author) or different (different printing edition)? In allcases, the choice depends upon the application at hand.

The core question that we address in this paper is: how can wefacilitate users in constructing, sharing and using context-sensitiveequality relationships between semantic web datasets?

We extend earlier work on linksets [2], link scripts [18] and sci-entific lenses [5] by making explicit both the links and the methodsfor constructing those links as well as the justifications for them.We can then manipulate such explicit representations (query them,combine them, validate them) and in this way construct differentequality relations depending on context.

Our approach consists of two steps: we first generate (multiplesets of) candidate correspondences (the linksets), and we explicitlyrepresent such candidate sets, including the reasons for why theywere generated as candidates and how they were generated (and bywhom, and when, etc.). We then can use this rich metadata to selectcandidate links and combine candidate sets into context-specificlenticular lenses, which then serve as a context-specific equalityrelation for a view over the integrated data. 4

This paper is structured as follows. We discuss related work insection 2, give formal definitions for our approach in section 3 andprovide an RDF model and implementation in section 4. Section5 introduces use cases from the STI domain, and evaluates howlenticular lenses help users to answer research questions that relyon alternative mappings across datasets. Section 6 concludes.

2 RELATEDWORKOur discussion of related work concerns three aspects: methods foridentifying and generating equality relations between entities acrossdatasets, and methods for expressing the correspondences and theirmetadata that allows us to select the right links on demand in sucha way that they do not lead to inconsistencies.

Basic terminology. We briefly clarify some of the terminologyused, and then discuss how the related work compares to the ap-proach presented in this paper. A relation between two entitiesoriginating from different datasets is either called an RDF link [2]or more specifically a correspondence triple [7] when the link ex-presses an equality relation. An alignment between two datasets iscalled a linkset [2] when the correspondence triples that make thealignment all use the same link predicate. This predicate is calledthe linktype of the linkset, e.g. owl:sameAs, skos:exactmatch.

Identifying and generating correspondences. In [4], a theoryis proposed for a contextualised owl:sameAs semantics that basesequality explicitly on the commitment to a set of identity criteria(property value pairs) that resources have in common, and thatset can be varied depending on context. Our approach generalises[4] as we allow for other methods to construct context-sensitiveequality relations beyond the simple intersection of equal property-value pairs. Furthermore, our approach allows to establish differentequality criteria between different pairs of datasets, while the theoryin [4] assumes a single universe across which the equality criteriaapply globally (details in section 3).

To find equality relations between resources, we can make use ofseveral tools that fall into two categories: those operating as a ‘black

4A Lenticular Lens is an array of magnifying lenses, designed so that when viewed from slightlydifferent angles, different images aremagnified.”, from http://en.wikipedia.org/wiki/Lenticular_lens.

box’ (e.g. using machine learning or other heuristics) and those thatrely on users for guiding the mapping process. Examples of theformer are AGDISTIS [16], LogMap [14], OtO [6], and other stringor graph-based similarity algorithms. These systems are intendedto be generic in the sense that they do not take context or domainspecific considerations into account. This makes them less suitablefor our purposes: as argued above, the appropriateness of a linksetis most often determined by contextual factors.

Tools of the user-guided type seem more promising: they gen-erate mapping triples based on user defined rules that serve asexplicit representations of the context-specific identity criteria. Forexample, Linkage Query Writer (LQW) [11, 12] uses requests ex-pressed in the LinQL language to discover a variety of relationsbetween resources in one or more datasets. Similarly, SILK [18] isdriven by a link specification language SILK-LSL that can be used toexpress the context-specific user-defined conditions under whichtwo resources are linked. LQW and SILK-LSL support syntactic,semantic and hybrid similarity metrics. The SILK workbench UIallows users to inspect the confidence value of the links it discovers.Unfortunately, these tools do not record the justifications of theircontext-specific mappings in a declarative form, e.g. as RDF meta-data. As a result, it is hard to decide if a particular context-specificmapping can be re-used for another purpose. The only justificationof the mappings from those tools is the implicit encoding of thecontext in the form of the mapping rules.

Amalgame [17] is an interactive tool primarily focused on thealignment of SKOS vocabularies.5 It can produce large provenancerecords that capture how correspondences were created, using acombination of the PROV-O vocabulary and reïfied RDF triples(statements).6 However, such a reïfied encoding is not a suitablerepresentation for on demand context-specific selection of candi-date linksets [15].

Amore pragmatic encoding is taken in theOpenPHACTS project7,which provides access to integrated biomedical Linked Data. Be-cause users have different requirements for ‘sameness’, this accessis through so-called ‘Scientific Lenses’ [5] that enable or disablespecific linksets. In OpenPhacts, these linksets are represented asnamed graphs containing pairs of resources that are linked usinga single predicate. To facilitate the selection of lenses, they areexpressed as RDF using an extension of the VOID vocabulary fordatasets8 that includes a reference to the justification for the linkset:i.e. the property on the basis of which the two resources were linked.Thus, the OpenPHACTS approach improves over SILK in having adeclarative representation and improves over Amalgame by havinga practically useful encoding. We therefore choose to build on theOpenPhacts approach.

However, OpenPhacts scientific lenses are limited in three re-spects. First, the model only allows to express metadata (such asconfidence) at the level of the lens or linkset as a whole (the namedgraph). Secondly, a lens can only combine multiple linksets throughtaking the union of such linksets. To overcome the first limitation inour approach we express metadata at the graph and at a triple level.This allows for more fine-grained reuse of candidate links from a

5See http://semanticweb.cs.vu.nl/amalgame.6See http://www.w3.org/TR/prov-o7See http://openphacts.org.8See https://www.w3.org/TR/void/

ii

http://en.wikipedia.org/wiki/Lenticular_lens

http://semanticweb.cs.vu.nl/amalgame

http://www.w3.org/TR/prov-o

http://openphacts.org

https://www.w3.org/TR/void/

linkset. Secondly, we define more complex operations over lenses,allowing for union, intersection, difference, and composition (seesections 3 and 4). The third limitation of scientific lenses in Open-Phacts is that the correspondences are expressed using the linkingpredicate (e.g. owl:sameAs) directly. As a result, multiple compet-ing lenses potentially introduce inconsistencies into a knowledgebase. To avoid this, we introduce a unique linking predicate perindividual correspondence.

Encoding correspondences. We must ensure that we can deco-rate each individual correspondence (so that they can be selectedand re-used), and ensure that the semantics of alternative linksetsdo not interact. In [7], it is suggested to use n-ary relations to ex-press correspondences when multiple alternative matches exist atthe same time. Essentially this pattern is similar to RDF reïfica-tion, which introduces a new resource of type rdf:Statement andthree predicates to identify the rdf:subject, rdf:predicate, andrdf:object of the reïfied triple. The idea is then to associate themetadata to the newly added statement. Unfortunately, the RDFsemantics does not allow us to infer the existence of the reïfiedtriple from the existence of an rdf:Statement. Furthermore, bothreification and n-ary relations not only introduce an overhead of atleast four new triples, but also querying becomes more difficult.

Alternatively, one can use a named graph for each triple. How-ever, this has the same effect on query complexity, and has anadditional negative effect because triple store indices are typicallyoptimized for fewer named graphs. One way to address this short-coming is to use an in-line syntax for expressing reïfied triples indata and queries ([10], see Listing 1). However, this requires RDF ∗and SPARQL∗.

1 <<graph1:VUA owl:sameAs graph2:Vrije\_Unitersiteit\_Amsterdam >>

dct:creator <http :// example.com/agdistis > ;

3 dct:created "2016 -04 -29 T06 :15:02Z"^^xsd:dateTime .

Listing 1: In-line reification, a Statement-Level Metadata

Nguyen et al. [15] propose ‘singleton properties’, where thepredicate of every individual triple (e.g. in a linkset) is uniquelyidentifiable, and is an ‘rdf:singletonPropertyOf’ a more genericproperty. This allows us to assign metadata for each candidate linkto its singleton property, while at the same time, the semantics ofthese links do not interact across linksets. As with reïfication, querycomplexity and performance are affected. But compared to otherapproaches, experiments suggest that this is less affected [13].

This section discussed entity matching tools, the use of linksetsand lenses to allow for different views over data, and ways toassign metadata to named graphs or individual triples. We concludethat there is no integrated framework that (a) allows user-definedmethods for generating correspondences, (b) allows context-specificselections of correspondences, and (c) allows rich operations overlinksets and lenses.

3 DEFINING LENTICULAR LENSESIn this section we will define the notions of contextualised equalityand the supporting notions of linkset and lens, before presenting inthe next sections an RDF model and an implementation for these.

The semantics of the OWL constructions for equality (owl:sameAsand owl:sameIndividual) adheres to Leibniz’ Law on identity ofindiscernibles, which in quasi-RDF notation would read:

∀p : x = y ↔ (⟨x, p, v ⟩ ↔ ⟨y, p, v ⟩) (1)where ⟨·, ·, ·⟩ denotes a triple. This bi-implication captures two sep-arate principles, each of which is too strong for useful deploymenton the semantics web. The→ direction captures the indiscernibil-ity of identicals (identical objects share all their properties). Eventhe simple examples from our Introduction make clear that this istoo strong, due to the quantification over all possible properties p,whereas in practice, the set of predicates that is used to determineequality differs between contexts. This is the problem of context in-dependence. The← direction captures the identity of indiscernibles.This is too strong for use on the semantic web because of the openworld assumption. Again, this could be repaired by restricting theset of predicates p to a finite set, such as those predicates occurringin some knowledge graph, or some ontology.

Context-sensitive indiscernibility. The unsuitability of theLeibniz principle for the Semantic Web was also noted by others.[4] defines a context as the set of predicates Π which are necessaryand sufficient to determine indiscernibility and hence equality:

∀p ∈ Π : x =Π y ↔ (⟨x ,p,v⟩ ↔ ⟨y,p,v⟩) (2)Because of the restriction to predicates in Π, → is now contextsensitive, and← is limited to a closed world.However, [4] is unclear about the treatment of properties p < Π.Should these properties propagate over equality, ie.

∀p < Π : x =Π y ∧ ⟨x ,p,v⟩ → ⟨y,p,v⟩) (3)or not? A simple example shows that the answer again differsper context. Drugbank [19] publishes data about more than 8000drugs with over 200 properties per drug. Among these are thechemical structure of the drug, its biological targets, and the brandnames under which it is sold. In a pharmaceutical setting, twodrugs are equal if they have the same chemical structure, hence thecontext-defining property would be: Π = {structure}. But the twoproperties tarдet andbrand , both not inΠ should behave differentlywith respect to (3). Two drugs with the same structure do addressthe same targets, hence (3) should apply to tarдet ; but two drugswith the same structure do not necessarily have the same brandname, hence (3) should not apply to brand .

For these reasons, we propose a richer definition of context,which recognises that both structure and tarдet are important inthe pharmacological context (as opposed to the pharmacologicallyirrelevant property brand), but that they play different epistemo-logical roles: structure determines the indiscernibility equivalenceclasses, and inside those classes the value of tarдet propagates, butit does not determine indiscernibility.

Definition of context. A context is defined by two sets of prop-erties, Π and Ψ, Π for indiscernibility, and Ψ for propagation:

∀p ∈ Π : x =(Π,Ψ) y ↔ (⟨x, p, v ⟩ ↔ ⟨y, p, v ⟩) (4)∀p ∈ Ψ : x =(Π,Ψ) y → (⟨x, p, v ⟩ ↔ ⟨y, p, v ⟩) (5)

∀p < Π ∪ Ψ : values of p remain unchanged

where (4) is the definition of contextualised equality, and (5) definescontextualised propagation. This allows for drugs that are indis-cernible in the pharmacological context (Π = {structure},Ψ ={structure, tarдet}) to still have different values for properties out-side Π ∪ Ψ, such as brand .

iii

Across multiple datasets All the above assumes that all prop-erties p, entities x and values v live in a single namespace. Inpractice, owl:sameAs links are often used to create links betweenmultiple datasets, with different namespaces. So simply stating⟨x ,p,v⟩ ↔ ⟨y,p,v⟩ is not enough if both sides live in differentnamespaces. We extend the definition above to take this into ac-count. We use the symbol ≈ to indicate an alignment between twoterms from different datasets. We choose the "approximate" sym-bol because such alignments are often indeed approximate string-matching (e.g. string-matching "New York" with "New_York"), butcan also be more elaborate (e.g. dictionary-based matching "DenHaag" with "’s Gravenhage"), or linguistic (e.g. translating "pneu-monia" to "longontsteking"). Now a context consists not only ofsets of predicates for indiscernability and propagation, but also ofan alignment procedure, i.e. a context is now (Π,Ψ,≈). Then thedefinition of contextualised equality (4) becomes

x =(Π,Ψ,≈) y ↔ ∀p1, p2 ∈ Π with p1 ≈ p2and ∀v1, v2 with v1 ≈ v2 :⟨x, p1, v1 ⟩ ↔ ⟨y, p2, v2 ⟩

(6)

and similar for the definition of contextualised propagation (5).We now have everything in place to define our central notions oflinkset and lenticular lens:

Linkset. Given two datasets D1 and D2 and a context (Π,Ψ,≈),a linkset L is the set of all pairs (x ,y) from D1 × D2 that are indis-cernible in that context: L = {(x ,y) ∈ D1 × D2 |x =(Π,Ψ,≈) y}. Inother words, a linkset is the set of all context-specific correspon-dences between two datasets.

Lenticular Lens.A lenticular lens is also a set of context-specificcorrespondences between two datasets, but is constructed fromlinksets through union, intersection, difference and composition:

(x, y) ∈⋃

Li ↔ ∃i : (x, y) ∈ Li(x, y) ∈

⋂Li ↔ ∀i : (x, y) ∈ Li

(x, y) ∈ La − Lb ↔ (x, y) ∈ La ∧ (x, y) < Lb(x, y) ∈ La ◦ Lb ↔ ∃z : (x, z) ∈ La ∧ (z, y) ∈ Lb

4 RDF MODEL AND IMPLEMENTATIONThis section presents an RDF model for Lenticular Lenses thatallows us to generate, select and combine alignments in a context-specific manner, aiming to maximise potential for future reuse. Toachieve this, we combine metadata about individual correspon-dences using singleton properties with metadata about an entirelinkset. The singleton properties are organised in a property hier-archy that allows us to capture context-specific notions of equalityat different levels: moving hierarchically from linktypes for veryspecific tasks to linktypes for more generic tasks, and vice versa;and finding a common shared notion of indiscernibility wheneverdifferent linksets or lenses are combined for a specific task.

Example. We illustrate our model (see Figure 1) in a simple ex-ample from the STI domain. We have three linksets, R, B, and Ythat each use different identity criteria to specify correspondencesbetween two datasets: ETER and GRID. To generate R, the entitiesfrom both datasets (research organisations) were aligned using theirresource identifier. For B, we applied an edit distance algorithmover organisation names, and only included pairs with similaritymeasure θ > 0.8. Finally, Y refines B by further demanding thatorganisations must have the same country as well as θ > 0.8.

4.1 Representing LinksetsEach correspondence in a linkset is linked through a singleton prop-erty with a triple-specific linktype. E.g. for :linkset_B, two corre-spondences are linked through my:sameAs_B_1 and my:sameAs_B_2:

1 :linkset_B {

eter:UK0021 my:sameAs_B_1 grid:grid .18886.3f .

3 eter:AT0059 my:sameAs_B_2 grid:grid .1468.33 . }

Listing 2: Singleton properties in a linkset B

We represent the metadata for individual correspondences in a sep-arate named graph.9 This graph specifies the evidence and strengthfor each singleton property, and relates them to a shared task-specific property (my:sameAs_B) in the generic named graph. Thislater graph serves as a basis for reproducibility.

1 my:sameAs_B_1

rdf:singletonPropertyOf my:sameAs_B ;

3 ll:strength 0.875 ;

ll:hasEvidence "[ Institute of Cancer Research]

5 was compared to [The Institute of Cancer Research ]" .

my:sameAs_B_2

7 rdf:singletonPropertyOf my:sameAs_B ;

ll:strength 0.818 ;

9 ll:hasEvidence "[ London Business school] was

compared to [Lauder Business School ]" .

Listing 3: Relating singletons to their task specific parent

A task-specific property (my:sameAs_B) is defined as anrdfs:subPropertyOf a domain specific linktype that correspondsto the domain knowledge used to determine the correspondence::approxStrSim for R and B, plus :geoSim for Y. In this case, the do-main knowledge of my:sameAs_B is expressed by relating the prop-erty to the :edit_distance algorithm defined elsewhere. These do-main specific linktypes are defined as the ll:contextPropertyOfof link predicates commonly used in LinkedData, such as owl:sameAs.We additionally specify each of these predicates to be a sub propertyof ll:isRelatedTo. Figure 1 illustrates this hierarchy.my:sameAs_B rdfs:subPropertyOf :approxStrSim ;

2 rdfs:comment "Align using organisation name with threshold > 0.8";

ll:algorithm :edit_distance .

Listing 4: Task- and domain specific linktypes.

Finally, the Linkset is defined as a graph using an extension ofthe VOID and BDB vocabularies, as in [5]. We provide three addi-tional predicates: the ll:mechanism predicate relates the linkset9We omit the TriG notation for the metadata graph for :linkset_B.

Figure 1: A meta-model for correspondences.iv

to the domain specific predicate; the ll:alignsSubjectsand ll:alignsObjects refine bdb:subjectsDatatype andbdb:objectsDatatype to indicate Π, the set of properties usedto determine the correspondences. In other words, following [4]we make explicit the identity criteria for our correspondences:

1 :linkset_B rdf:type void:Linkset ;

void:linkPredicate my:sameAs_B ;

3 void:triples 816 ;

void:subjectsTarget :eter ;

5 void:objectsTarget :grid ;

ll:mechanism :approxStrSim ;

7 ll:alignsSubjects eter:english_Institution_Name ;

ll:alignsObjects grid:name ;

9 bdb:linksetJustification :justification_linkset_B .

:justification_linkset_B

11 rdfs:comment "Similarity measured in the interval ]0.8 1[." .

Listing 5: Defining metadata for linkset B

4.2 Using Linksets in a LensWe can use the above representation of linksets to construct aLenticular Lens that reflects the use case specific identity criteria.In [5], Lens definitions are used to look up the linksets that should be‘enabled’ for a specific SPARQL query: they are the named graph(s)that scope the query. Our approach differs in two important ways.First, in [5], enabling multiple linksets for a query makes all ofits correspondences true for that query. In other words: the lenscan only express a union of linksets. Secondly, our linksets areexpressed using singleton properties, rather than the actual equalityrelation. We thus need to define how a lens combines linksets inmore expressive ways; and how the lens can make the equalityrelation visible.

For example, consider the scenario where a user only wantsto take into account correspondences that have a similar name(θ > 0.8), have the same resource identifier, and are in the samecountry. In Listing 6 we specify such a Lens as the intersection ofthe the linksets R and Y from the example above (expressed usingthe ll:operator property).

1 :lens_R_Y rdf:type bdb:Lens , ll:LenticularLens ;

void:target :linkset_R , :linkset_Y ;

3 void:triples 66 ;

ll:expectedCorrespondences 33 ;

5 ll:removedDuplicates 7 ;

ll:operator :intersection ;

7 bdb:assertionMethod :sparql_R_Y .

:sparql_R_Y ll:sparql :listing_1 .8 .

Listing 6: Defining metadata for Lens_R_Y.

As singleton properties do not carry semantics by themselves, weneed to define how they will be used when answering queriesover the datasets to which they apply. This is where the linktypehierarchy comes into play: merging linksets where the linktypes arenormalized to a shared ancestor in the property hierarchy. Using thespecification in Listing 6 we generate a SPARQL query (Listing 7)that does just this: where the pairs of resources from linksets R andY are the same, we insert an equality relation using the most specificcommon ancestor for the respective singleton properties. The queryis in essence a template that can be applied to any combination oftwo linksets that are merged using the intersection operator.10 Theresulting lens :lens_R_Y can now be used as in [5].

10Note that the only ‘fixed’ resources are the URIs of the linkset and metadata graphs.

PREFIX : <http :// another.example/>

2 INSERT {

GRAPH :lens_R_Y { ?eter_resource ?ancestor ?grid_resource } }

4 WHERE {

GRAPH :linkset_R { ?eter_resource ?Sing_R ?grid_resource .}

6 GRAPH :linkset_Y { ?eter_resource ?Sing_Y ?grid_resource .}

GRAPH :metadata_R { ?Sing_R rdf:singletonPropertyOf ?parent_R . }

8 GRAPH :metadata_Y { ?Sing_Y rdf:singletonPropertyOf ?parent_Y . }

GRAPH :propertyHierarchy {

10 # GET ALL COMMON ACESTORS

?parent_R (rdfs:subPropertyOf|ll:contextPropertyOf)+ ?ancestor .

12 ?parent_Y (rdfs:subPropertyOf|ll:contextPropertyOf)+ ?ancestor .

# EXCLUDE COMMON ANCESTORS THAT HAVE A COMMON CHILD ANCESTOR

14 FILTER NOT EXISTS {

?parent_R (rdfs:subPropertyOf|ll:contextPropertyOf )* ?cAncestor.

16 ?parent_Y (rdfs:subPropertyOf|ll:contextPropertyOf )* ?cAncestor.

?cAncestor (rdfs:subPropertyOf|ll:contextPropertyOf) ?ancestor .

18 } } }

Listing 7: SPARQL query for generating Lens_R_Y withoutmetadata by implementing an intersection over linksets.

Although the lens described in Listing 7 is very simple, if thereis a need to combine multiple alignments, finding the commonshared linktype becomes more challenging. We provide formaldefinitions of supported operators (union, intersection, differenceand composition) in section 4.3.

4.3 Complex Operations over LinksetsA lens is a combination of linksets and/or other lenses using set-like operators such as union, intersection, difference and composition.Although these operators are implemented in SPARQL, they can notbe used off-the-shelf in our framework for three reasons. First, theequality relation is symmetric: the direction of a triple is irrelevantfor the indiscernability relationship: ⟨e1, r , e2⟩ iff ⟨e2, r , e1⟩ (wherer is the linktype). Second, in the proposed approach, the identityis represented through a unique singleton property: two instancesof the same identity between e1 and e2 will look like ⟨e2, r , e1⟩and ⟨e2, r ′, e1⟩ respectively, with r different from r ′. Third, newsingleton properties will have to be generated for the resultinglens. Other than these technicalities, the operators behave as theirset-theoretic counterparts. These issues are addressed in three steps.

Algorithm 1: Algorithm for generating a Lens using the UNION operator.An alignment is composed of three graphs: main (the correspondences graph),generic (the generic metadata graph) and specific (the singleton graph).input :Set of alignments G (Linkset/Lens) of size n triplesoutput :LensComplexity:O (n loд n) assuming an efficient search is usedNote: n is the total number of triples while k (much smaller than n)is the number of duplicates for each iteration in n.begin

do genericGraph.add( generic metadata about the lens )for д ∈ G do /* O (n) */

tempGraph.add(g)for ⟨x, p, y ⟩ ∈ tempGraph do /* O (n) */

triples← [⟨x, q, y ⟩ | ⟨x, q, y ⟩ ∈ tempGraph] /* search of O (loд n) */tempGraph← tempGraph - triplespreds← [q | ⟨x, q, y ⟩ ∈ triples]r← singletonProperty(preds)if x <д y then

mainGraph.add(⟨x, r, y ⟩)else

mainGraph.add(⟨y, r, x ⟩)for ⟨x, q, y ⟩ ∈ triples do /* O (k ) */

specificGraph.add(⟨ r, prov:derivedFrom, q ⟩)return [mainGraph, genericGraph, specificGraph]

v

We first abstract from the individual singleton properties andtheir direction by collecting all unordered pairs {e1i , e2i } for which⟨e1, r , e2⟩ ∈ L for some singleton property r . Second, we create alens with new singleton properties ri for each pair that satisfied theconditions of the construction operator (intersection etc.). Finally,we choose a consistent direction for each pair (⟨e1, ri , e2⟩ based onthe alphabetical order of the names of the datasets in which theyare described).

In summary, to generate a lens we first ignore the unique predi-cates and their direction. Then, after the set operation is performed,we reconstruct unique linking predicates for the resulting triples. Al-gorithm 1 illustrates it using the union operator where the efficiencyof such procedure comes down to using a fast search algorithm.

5 QUALITATIVE EVALUATIONAs usual with methodological proposals, it is not obvious howto perform a quantitative evaluation of the proposed model. Wetherefore proceed with a qualitative evaluation through a series ofincreasingly complex case studies taken from a research project inthe STI field to show that the framework has both the necessaryand sufficient functionalities to fulfil the needs of these case studies.We first present two illustrative use-cases in subsection 5.1 . Insubsection 5.2, we present a more complicated realistic case andshow how the model for lenticular lenses can fulfil the requirementsof such a complicated yet realistic use-case. We also discuss timeand space complexity in subsections 4.3 and 5.3.

To visualise the execution of the use-cases, we implemented aprototype user interface (http://lenticular-lens.risis.eu/) to ease the userinteraction over the Lenticular Lens system that realises the theoryand framework described above. It addresses data integration as aprocess where users connect data in small steps and enables themto analyse, refine and combine linking results at each step of theircontext-specific integration. Furthermore, with its ever growinglibrary of linksets and lenses, the system allows users with similarinterests to reuse or modify existing alignments.

5.1 Case study 1 & 2Research question andDataset selection. The first two inputs

of the system are the context and the datasets of interest. Since theuse-cases are taken from observing and supporting social scientists,the system associates a data-linking task to the context of a spe-cific research question. Case studies 1 and 2 (both introduced insection 1) share the same goal of aligning GRID and OrgRef, buthave different motivating research questions: “What are innova-tion strategies of multinational firms?" and “How does the Canadianindustrial structure compare to the US structure?" respectively. Conse-quently, case study 1 needs to align firms using the company name,independently of their geographic locations, while case study 2aligns them on the basis of matching both their name and theirlocation.

Linkset creation.We first look at case study 1. Figure 2 showsthat a context is defined in the scope of the research question byselecting the datasets (GRID and Orgref), the entity types of in-terest (Institutions, Organisation), the properties to be used in thematching (name for GRID and Name for OrgRef, i.e. the set Π), and amatching method (approximate string matching, i.e. the procedure

Figure 2: Linkset creation

≈).11. These four inputs together result in a linkset of potentialcorrespondences.

The current implementation supports six matching mechanisms:exact and approximate string, URI identity, embedded (links presentin the source dataset), intermediate (use of an intermediate datasource for similarity), and geo-similarity, but this set is intentionallyopen-ended and can be easily extended.

For case study 2, redefining a linkset from scratch is not neededas the linkset from case study 1 can be re-used and intersected withan additional matching mechanism: geo-similarity (for ≈) appliedto the properties {country and Country } of GRID and OrgRefrespectively (for Π). Such explicit re-use of previously constructedalignments, including a full provenance trail, is an essential featureof our approach which distinguishes it from others in the literature.

Selecting andapproving correspondences.After creating theselinksets, the user can manually refine them. To illustrate this step,we filter the linkset of study 1 based on correspondences that have aname similarity greater than 0.85 and the token “3M” in the name ofthe subject or object resource. Six resources from GRID are linkedto the single occurrence of 3M in OrgRef. The user can now eval-uate each correspondence by enriching it with more annotations,in particular by annotating them as accepted or rejected. Note thatcorrespondences marked as rejected are not removed, but remainin the linkset to allow the linkset to be reused, and to provide afull provenance trail, again a crucial feature that sets our approachapart from others in the literature.

View and Final Integration. As the final step, the user createsa final Lens by selecting the set of correspondences to be used inthe study. A View selects for all resources what properties (Ψ ⊃ Π)should be shown as part of a tabular representation presented to theuser. In our case studies 1 and 2, the Lens selects all correspondencesthat have not been labelled as rejected.

5.2 Case Study 3The third case study 12 illustrates the need to combine multiplelenses and shows how the proposed model is rich enough to deal11As a simplification, our current implementation on linkset always assumes Ψ = Π.12https://youtu.be/CcffBlCBF54?list=PLo4YbUaRFSnwJ9XJvp6rlIMsaw_rfKT9C

vi

http://lenticular-lens.risis.eu/

https://youtu.be/CcffBlCBF54?list=PLo4YbUaRFSnwJ9XJvp6rlIMsaw_rfKT9C

with this complex case. The aim is to investigate whether the univer-sity ranks in the LeidenRanking dataset can be predicted from theproperties of the university and its environment in other datasets.For this, we need to align universities across five datasets: ETER(providing characteristics of European higher education institutionssuch as third party funding, total academic staff, total expenditure,students enrolled etc), GRID (a worldwide collection of academicresearch institutes), LeidenRanking (performance metrics of over800 major universities worldwide), GRID_Enr and ETER_Enr. Thelast two datasets are subsets of GRID and ETER enriched with geo-graphic boundaries from GADM13 and a count on the number ofworldwide organisations (GRID_Enr) and European universities(ETER_Enr). The dotted lines in fig. 3 show the complex mappingsbetween these datasets that are needed to answer the research ques-tion. Because of this complexity, this use case is suitable to test thestrength of our approach.

Linkset creation. Listing 8 gives a textual representation ofwhat was shown in the user interface of Figure 2: research question,datasets, types of entities to be matched, properties to be aligned,and the matching mechanism to be used. ETER is directly alignedto three different datasets using different types, properties andmechanisms, resulting in Linkset1 through Linkset6.ResearchQuestion

2 Can we predict the CWTS scores from

characteristics of the university?

4 Mapping (Dataset|EntityType)

LeidenRanking|University

6 GRID|Institution GRID_GADM_enriched|Institution

ETER|University ETER_GADM_enriched|University

8Linkset1 specifications Linkset2 specifications

10 -------------------------------------------------------------------

Dataset {ETER , GRID} {ETER , GRID}

12 EntityType {University , Institution} {University , Institution}

IndisProp {inst_name , name} {eng_Inst_name , name}

14 Mechanism Approximate Similarity Approximate Similarity

16 Linkset3 specifications Linkset4 specifications

-------------------------------------------------------------------

18 Dataset {ETER , LeidenRanking} {ETER , LeidenRanking}

EntityType {University , University} {University , University}

20 IndisProp {inst_name , university} {eng_Inst_name , university}

Mechanism Approximate Similarity Approximate Similarity

22Linkset5 specifications Linkset6 specifications

24 -------------------------------------------------------------------

Dataset {ETER , ETER_GADM_Enr} {GRID , GRID_GADM_Enr}

26 EntityType {University , University} {Institution , Institution}

IndisProp {rsc_uri , rsc_uri} {rsc_uri , rsc_uri}

28 Mechanism Identifier Match Identifier Match

Listing 8: (Π,Ψ,≈) of case study 3

Lens creation. From the linksets generated in the previous step,we generate lenses lens1 = Linkset1 ∪ Linkset2 and lens2 =Linkset3 ∪ Linkset4 to align ETER with GRID and the Leiden-Ranking. This is also captured by the solid arrows in Figure 3.

View and Final Integration Lens. As explained above, con-structing a View is the final step for expressing the user’s context-specific integration perspective over the data (see Figure 3). Thesystem allows the construction of complex combinations of lenses.Listing 9 shows the specification for generating the final View re-quired for use case 3, where Lens3 represents the final contextual

13GADM: http://www.gadm.org

Figure 3: Integration Model of Case Study 3

integration:Lens3 = Linkset5 ∩ Linkset6 ∩ Lens1 ∩ Lens2

= Linkset5 ∩ Linkset6 ∩(Linkset1 ∪ Linkset2) ∩ (Linkset3 ∪ Linkset4)

Lens3 specifies different filters that refine the conditions on whicha correspondence is to be included. For example, CrF1 is a filterthat includes only those correspondences from Lens1 that haveSTRENGTH=2, that have a THRESHOLD greater or equal to 8.5, andthat have been evaluated as ACCEPTED. Listing 9 also shows a listof the 15 properties (for Ψ) that are used to construct the Viewtable; they are the ‘variables’ for which the data is studied.FinalIntegrationLens

2 Lens3 = Linkset5 ∩ Linkset6 ∩ Lens1 ∩ Lens2CorrespondenceFilter

4 Cr F1 = Lens1 ← FILTER (STRENGTH = 2, THRESHOLD ⩾ 8.5, ACCEPT)

Cr F2 = Lens3 ← FILTER (STRENGTH = 2, THRESHOLD ⩾ 8.5)

6 PropertySelection (Ψ)

ETER GRID GRID_ENR ETER_ENR Leiden ...

8 -----------------------------------------------------------------

eng_Inst_Name name level level University

10 total_expenditure_EURO country typeCount total Country

Number of R&D orgs city Field

12 PP_top_10 types

Listing 9: Specifications for generating a final lens and view

The outcome of this view is a table that is readily usable by aresearcher to answer a research question. In this case, to whatextent the variables indeed predict the ranking is to be answered -but the correlation between the number of R&D organizations andthe Leiden Ranking performance score (with the Dutch universitiesonly) is 0.58. The strong correlation suggests indeed that furtheranalysis is useful as it indicates potential causality. 14

This shows that realistic alignment processes such as those incase study 3 require complex combinations of operations over indi-vidual separate alignments, requiring the additional functionalityof our Lenticular Lens system over other systems in the literature,and justifying the rich meta-data model over alignments that weintroduced above.

5.3 ComplexityTime complexity.Given that creating a linkset requires two datasetsX and Y , its time complexity depends on the alignment algorithmused and the sizes of the datasets. For example, using a naive string-based matching algorithm, each entry of the source dataset is com-pared to all those of the target dataset which takes O(|X | |Y |) time.Instead, using a blocking solution (implemented in the framework)takes O(|X | |Z |), where Z ⊂ Y with |Z|≪ |Y|. Furthermore, the cost14Techreport about this study is available at http://sms.risis.eu/usecase_universities_environemnt_relation_performance

vii

http://sms.risis.eu/usecase_universities_environemnt_relation_performance

http://sms.risis.eu/usecase_universities_environemnt_relation_performance

for decorating the generated correspondences is low: an insertion ofcost O(1) for each correspondence. Other performance parametersare strong triple store dependent. Computing an exact similaritybetween resources is fully implemented in SPARQL, and in practicetakes less than a minute to create a linkset between GRID (74523instances) and OrgRef (32010 instances) on commodity hardware.

Space Complexity. The space complexity is determined by thenumber of matches found between the datasets (the size |L| ofa linkset), with 0 < |L| < |X | |Y |. However, in practice |L| ≪max(|X |, |Y |). The decoration adds only a fixed number n of triplesper correspondence, resulting in an output n times bigger than anon-decorated linkset.

6 CONCLUSIONThis paper presents a system for constructing context-specific linksbetween data-sets: Lenticular Lenses. A context-specific approachis needed because whether or not resources in different datasetsare to be considered indiscernible strongly depends on the pur-pose and the task for which the datasets are to be used. We donot propose any new disambiguation method. Instead, our sys-tem allows the use of existing alignment methods to construct alinkset of potential mappings while annotating thesemappings withrich meta-data using the proposed meta-model. Such meta-dataenables Linked Data users to combine a variety of matching toolsto obtain multiple context-sensitive alignments through simpleSPARQL queries over the correspondences and their annotations.Compared to owl:sameAs, the proposed context:sameAs providesgeneric/shared metadata for alignment reproducibility, and spe-cific correspondence metadata for context-specific re-usability andvalidation.

In three different case studies from the social science disciplineof science and technology studies, we showed that our approachprovides the necessary functionality to fulfil the requirements ofcomplex realistic alignment problems: we maintain a rich meta-model that allows the user to select candidate alignments on avariety of properties such as tool of origin, alignment strength,user-approval, types of resources, properties used etc; the declara-tive representation of linksets and lenses allows the constructionof new lenses by re-using previously constructed alignments (re-usability) and also facilitates reproducibility (shared metadata); weallow complex operations over alignments: not only the unionoperator from existing systems, but also intersection, differenceand composition; we maintain a full declarative provenance trailthat records which correspondences in a lens originate from whichlinkset, and in each linkset by which tool they were generated,among others; we support a mixed-initiative approach, where can-didate correspondences generated by algorithms can be vetted orrejected through user-intervention, while rejected correspondencesremain part of the linkset, albeit labelled as rejected in order tomaintain a full provenance trail and re-usability. Our three realis-tic case studies from social e-science have shown that all of thesefeatures are indeed required to develop an environment for theconstruction and manipulation of context-sensitive links betweendatasets.

As singleton properties may not be easily used, the system al-lows for converting the enriched alignments into the usual flat

format. It also allows for importing flat alignments generated byother tools. Although these alignments come without annotation,it is still possible to document the correspondences (e.g. validation).This can serve as a justification for selecting contextually validcorrespondences. As future work, we plan to integrate state of theart alignment algorithms such as [1] and automate the conversionof other reïfication models into singleton properties and vice versa.Another important feature to include is the detection of a chain ofequality predicates across datasets and there respective contextualinformation. Furthermore, referenced resources evolve throughtime. This leaves us with the need to investigate its impact over thenetwork of correspondences in the Lenticular Lens system, includ-ing changes of statements due to new dataset versions. Additionally,we plan to enable the linking of datasets under construction to exter-nal datasets. Finally, beside deploying and validating a productionversion of the user interface, we plan to investigate an ontology forthe Lenticular Lens (ll) vocabulary.

REFERENCES[1] M. Al-Bakri, M. Atencia, J. David, S. Lalande, and M.-C. Rousset. Uncertainty-

sensitive reasoning for inferring sameas facts in linked data. In 22nd europeanconference on artificial intelligence (ECAI), pages 698–706. IOS press, 2016.

[2] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. Describing LinkedDatasets with the VoID Vocabulary. Technical report, 2011. W3C Note. URL:https://www.w3.org/TR/void/.

[3] C. Batchelor et al. Scientific lenses to support multiple views over linked chem-istry data. In ISWC 2014, pages 98–113. Springer, 2014.

[4] W. Beek, S. Schlobach, and F. van Harmelen. A contextualised semantics for owl:sameas. In International Semantic Web Conference, pages 405–419. Springer, 2016.

[5] C. Brenninkmeijer, C. Evelo, C. Goble, A. J. Gray, P. Groth, S. Pettifer, R. Stevens,A. J. Williams, and E. L. Willighagen. Scientific lenses over linked data: Anapproach to support task specific views of the data. a vision. In Proceedings of2nd international workshop on linked science, 2012.

[6] E. Daskalaki and D. Plexousakis. Oto matching system: a multi-strategy approachto instance matching. In CAiSE, pages 286–300. Springer, 2012.

[7] J. Euzenat and P. Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE),2nd edition, 2013.

[8] E. J. Hackett, O. Amsterdamska, M. Lynch, and J. Wajcman. The handbook ofscience and technology studies. The MIT Press, 2008.

[9] H. Halpin, P. J. Hayes, J. P. McCusker, D. L. McGuinness, and H. S. Thompson.When owl: sameas isn’t the same: An analysis of identity in linked data. In TheSemantic Web–ISWC 2010, pages 305–320. Springer, 2010.

[10] O. Hartig and B. Thompson. Foundations of an alternative approach to reificationin rdf. arXiv preprint arXiv:1406.3399, 2014.

[11] O. Hassanzadeh, A. Kementsietsidis, L. Lim, R. J. Miller, and M. Wang. A frame-work for semantic link discovery over relational data. In 18th ACM conference onInformation and knowledge management, pages 1027–1036. ACM, 2009.

[12] O. Hassanzadeh, R. Xin, R. J. Miller, A. Kementsietsidis, L. Lim, and M. Wang.Linkage query writer. Proceedings of the VLDB Endowment, 2(2):1590–1593, 2009.

[13] D. Hernández, A. Hogan, C. Riveros, C. Rojas, and E. Zerega. Querying wiki-data: Comparing sparql, relational and graph databases. In ISWC, pages 88–103.Springer, 2016.

[14] E. Jiménez-Ruiz and B. C. Grau. Logmap: Logic-based and scalable ontologymatching. In The Semantic Web–ISWC 2011, pages 273–288. Springer, 2011.

[15] V. Nguyen, O. Bodenreider, and A. Sheth. Don’t like RDF reification?: makingstatements about statements using singleton property. In Proceedings of the 23rdinternational conference on World wide web, pages 759–770. ACM, 2014.

[16] R. Usbeck, A.-C. N. Ngomo, M. Röder, D. Gerber, S. A. Coelho, S. Auer, and A. Both.AGDISTIS-graph-based disambiguation of named entities using linked data. InThe Semantic Web–ISWC 2014, pages 457–471. Springer, 2014.

[17] J. Van Ossenbruggen, M. Hildebrand, and V. De Boer. Interactive vocabularyalignment. In International Conference on Theory and Practice of Digital Libraries,pages 296–307. Springer, 2011.

[18] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Discovering and maintaining linkson the web of data. In ISWC, pages 650–665. Springer, 2009.

[19] D. S. Wishart, C. Knox, A. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam,and M. Hassanali. Drugbank: a knowledgebase for drugs, drug actions and drugtargets. Nucleic Acids Research, 36(Database-Issue):901–906, 2008.

viii

is my:sameas the same as your:sameas?frankh/postscript/kcap2017.pdf · • computing methodologies...

Documents