1 berendt: advanced databases, 2011, berendt/teaching 1 advanced databases – core ideas of...

77
Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching 1 Advanced databases – Core ideas of federated databases; Schema and ontology matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.ac.be/~berendt/teaching/2011-12- 1stsemester/adb/ ast update: 18 October 2011

Upload: kevin-waters

Post on 27-Dec-2015

239 views

Category:

Documents


3 download

TRANSCRIPT

1Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

1

Advanced databases –

Core ideas of federated databases; Schema and ontology matching

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://www.cs.kuleuven.ac.be/~berendt/teaching/2011-12-1stsemester/adb/

Last update: 18 October 2011

2Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

2

Until now ...

... we have looked into modelling

... we have seen how the languages RDF and OWL allow us to combine different schemas and data

... we have seen how Linked Data on the Web uses HTTP as a connecting protocol/architecture

... we have assumed that such combinations can be done effortlessly (unique names etc.)

... we have looked at some interpretation problems associated with these procedures

Now we need to ask: What are (further) challenges of such combinations?

What are approaches proposed to solve it?

– from the databases & the Semantic Web / ontologies fields

– from architectural and logical points of view

3Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

3Motivation 1: Price comparison engines search & combine heterogeneous travel-agency DBs, which seach & combine heterogeneous airline DBs

4Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

4

Motivation 2: Schemas coming from different languages

A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]

Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.

Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.

5Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

5

Motivation 3 (a): Are these the same entity?

6Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

6

Motivation 3 (b): „Who is that?“ – Merging identities

Mickey Mouse

7Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

7

Motivation 3 (c): „Who was that?“ – Re-identification

8Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

8

High-level overview: Goals and approaches in data integration

Basic goal: Combine data/knowledge from different sources

Goal / emphasis can lie on finding correspondences between

the models schema matching, ontology matching

the instances record linkage

Techniques can leverage similarities between

schema/ontology-level information

instance information

most of today

An established problem in DB; a focus and challenge for LOD (“owl:sameAs“)

9Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

9

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

10Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

10

Overview

goal: interoperability through data integration:

combining heterogeneous data sources under a single query interface

A federated database system is a type of meta-database management system (DBMS) which transparently integrates multiple autonomous database systems into a single federated database.

The constituent databases are interconnected via a computer network, and may be geographically decentralized.

Since the constituent database systems remain autonomous, a federated database system is a contrastable alternative to the (sometimes daunting) task of merging together several disparate databases.

A federated database (or virtual database) is the fully-integrated, logical composite of all constituent databases in a federated database system.

11Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

11

Issues in federating data sources

Interconnection and cooperation of autonomous and heterogeneous databases must address

Distribution

Autonomy

Heterogeneity

12Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

12

Architectures: Dealing differently with autonomy

Tightly coupled: global schema integration, e.g. data warehousing

More loosely coupled: federated databases with schema matching/mapping:

Global as View (GaV): the global schema is defined in terms of the underlying schemas

Local as View (LaV): the local schemas are defined in terms of the global schema

13Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

13

Issues in query processing

In both GaV and LaV systems, a user poses conjunctive queries over a virtual schema represented by a set of views, or "materialized" conjunctive queries.

Integration seeks to rewrite the queries represented by the views to make their results equivalent or maximally contained by our user's query.

This corresponds to the problem of answering queries using views.

14Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

14

An example developed in-house: SQI - PLQL

Purpose: For federated search in learning-object repositories

An approach with conceptual-level abstraction from data sources

Integratable data source types:

Relational, XML, IR systems, (search engine) Web services, search APIs

Full abstraction of user from data sources:

Yes

User-specific data souce selection for integration:

Depends on application

User-specific data modeling for integration:

No

Explicit, queryable semantics:

(delegated to the sources: LOM etc.)

15Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

15

Heterogeneity

Heterogeneity is independent of location of data

When is an information system homogeneous?

Software that creates and manipulates data is the same

All data follows same structure and data model and is part of a single universe of discourse

Different levels of heterogeneity Different languages to write applications Different query languages Different models Different DBMSs Different file systems Semantic heterogeneity etc.

16Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

16

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

17Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

17

The match problem(Running example 1)

Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other

18Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

18

Running example 2

19Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

19

Motivation: application areas

Schema integration in multi-database systems

Data integration systems on the Web

Translating data (e.g., for data warehousing)

E-commerce message translation

P2P data management

Model management (tools for easily manipulating models of data)

20Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

20

Based on what information can the matchings/mappings be found?

(work on the two running examples)

21Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

21

The match operator

Match operator: f(S1,S2) = mapping between S1 and S2

for schemas S1, S2

Mapping

a set of mapping elements

Mapping elements

elements of S1, elements of S2, mapping expression

Mapping expression

different functions and relationships

22Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

22

Matching expressions: examples

Scalar relations (=, ≥, ...) S.HOUSES.location = T.LISTINGS.area

Functions T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

ER-style relationships (is-a, part-of, ...) Set-oriented relationships (overlaps, contains, ...) Any other terms that are defined in the expression language used

23Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

23

Matching and mapping

1. Find the schema match („declarative“)

2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“)

Example of result of step 2: To create T.LISTINGS from S (simplified notation):

area = SELECT location FROM HOUSES

agent-name = SELECT name FROM AGENTS

agent-address = SELECT concat(city,state) FROM AGENTS

list-price = SELECT price * (1+fee-rate)

FROM HOUSES, AGENTS

WHERE agent-id = id

24Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

24Based on what information can the matchings/mappings be found?

Rahm & Bernstein‘s classification of schema matching approaches

25Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

25

Challenges

Semantics of the involved elements often need to be inferred

Often need to base (heuristic) solutions on cues in schema and data, which are unreliable

e.g., homonyms (area), synonyms (area, location)

Schema and data clues are often incomplete e.g., date: date of what?

Global nature of matching: to choose one matching possibility, must typically exclude all others as worse

Matching is often subjective and/or context-dependent e.g., does house-style match house-description or not?

Extremely laborious and error-prone process e.g., Li & Clifton 200: project at GTE telecommunications:

40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years

26Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

26

Semi-automated schema matching (1)

Rule-based solutions

Hand-crafted rules

Exploit schema information

+ relatively inexpensive

+ do not require training

+ fast (operate only on schema, not data)

+ can work very well in certain types of applications & domains

+ rules can provide a quick & concise method of capturing user knowledge about the domain

– cannot exploit data instances effectively

– cannot exploit previous matching efforts

(other than by re-use)

27Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

27

Semi-automated schema matching (2)

Learning-based solutions

Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“)

Exploit schema information and data

Some approaches: external evidence Past matches

Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“)

Corpus of users (more details later in this slide set)

+ can exploit data instances effectively

+ can exploit previous matching efforts

– relatively expensive

– require training

– slower (operate data)

– results may be opaque (e.g., neural network output) explanation components! (more details later)

28Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

28

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

29Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

29

Overview (1)

Rule-based approach

Schema types:

Relational, XML

Metadata representation:

Extended ER

Match granularity:

Element, structure

Match cardinality:

1:1, n:1

30Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

30

Overview (2)

Schema-level match:

Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations

Constraint-based: data type and domain compatibility, referential constraints

Structure matching: matching subtrees, weighted by leaves

Re-use, auxiliary information used:

Thesauri, glossaries

Combination of matchers:

Hybrid

Manual work / user input:

User can adjust threshold weights

31Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

31

Basic representation: Schema trees

Computation overview:

1. Compute similarity coefficients between elements of these graphs

2. Deduce a mapping from these coefficients

32Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

32

Computing similarity coefficients (1): Linguistic matching

Operates on schema element names (= nodes in schema tree)

1. Normalization Tokenization (parse names into tokens based on punctuation, case,

etc.)

e.g., Product_ID {Product, ID}

Expansion (of abbreviations and acronyms)

Elimination (of prepositions, articles, etc.)

2. Categorization / clustering Based on data types, schema hierarchy, linguistic content of names

e.g., „real-valued elements“, „money-related elements“

3. Comparison (within the categories) Compute linguistic similarity coefficients (lsim) based on thesarus

(synonmy, hypernymy)

Output: Table of lsim coefficients (in [0,1]) between schema elements

33Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

33

How to identify synonyms and homonyms: Example WordNet

34Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

34

How to identify hypernyms: Example WordNet

35Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

35

Computing similarity coefficients (2): Structure matching

Intuitions:

Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods

Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important)

Procedure:

1. Initialize structural similarity of leaves based on data types

Identical data types: compat. = 0.5; otherwise in [0,0.5]

2. Process the tree in post-order

3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold

4. .

36Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

36

The structure matching algorithm

Output: an 1:n mapping for leaves

To generate non-leaf mappings: 2nd post-order traversal

37Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

37

Matching shared types

Solution: expand the schema into a schema tree, then proceed as before

Can help to generate context-dependent mappings

Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)

38Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

38

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

39Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

39

Main ideas

A learning-based approach

Main goal: discover complex matches

In particular: functions such as

T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate)

T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state)

Works on relational schemas

Basic idea: reformulate schema matching as search

40Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

40

Architecture

Specialized searchers are specialized on discovering certain types of complex matches make search more efficient

41Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

41

Overview of implemented searchers

42Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

42

Example: The textual searcher

For target attribute T.LISTINGS.agent-address: Examine attributes and concatenations of attributes from S Restrict examined set by analyzing textual properties

Data type information in schema, heuristics (proportion of non-numeric characters etc.)

Evaluate match candidates based on data correspondences, prune inferior candidates

43Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

43

Example: The numerical searcher

For target attribute T.LISTINGS.list-price:

Examine attributes and arithmetic expressions over them from S

Restrict examined set by analyzing numeric properties

Data type information in schema, heuristics

Evaluate match candidates based on data correspondences, prune inferior candidates

44Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

44

Search strategy (1): Example textual searcher

1. Learn a (Naive Bayes) classifier

text class („agent-address“ or „other“)

from the data instances in T.LISTINGS.agent-address

2. Apply this classifier to each match candidate (e.g., location, concat(city,state)

3. Score of the candidate = average over instance probabilities

4. For expansion: beam search – only k-top scoring candiates

45Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

45

Search strategy (2): Example numeric searcher

1. Get value distributions of target attribute and each candidate

2. Compare the value distributions (Kullback-Leibler divergence measure)

3. Score of the candidate = Kullback-Leibler measure

46Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

46

Evaluation strategies of implemented searchers

47Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

47

Pruning by domain constraints

Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes

Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates

Properties of multiple attributes of T: „lot-area and num-baths are unrelated“ at match selector level, „clean up“:

Example

– T.num_baths S.baths

– ? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths

Based on the domain constraint, drop the term involving S.baths

48Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

48

Pruning by using knowledge from overlap data

When S and T share the same data

Consider fraction of data for which mapping is correct

e.g., house locations:

S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address

Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location,

keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)

49Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

49

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

50Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

50

What is ontology matching (relative to schema matching)?

same basic idea

but works on ontologies that are conceptual models (not on logical schemas such as relational tables or XML trees)

emphasizes that concepts and relations need to be matched and mapped, and may treat these differently

(Note: in the schema matching literature, it is not always clearly laid out whether the matched items come from a conceptual or a logical model; the toy examples above in particular are also conceptual)

In practice, some ontology matching tasks in fact work on such simple models (or simple subparts of models) that they do not differ at all from what we have seen so far

example: Anatomy task, see below in evaluation

Terminology: Also known as ontology alignment See (Shvaiko & Euzenat, 2005) for more details

51Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

51Recap: Rahm & Bernstein‘s classification of schema matching approaches

52Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

52The methods that are important when the schema is in the foreground (which it is in ontologies!)

53Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

53

The extension by Shvaiko & Euzenat (2005) [Partial view]

54Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

54

A classification of approaches

See above

55Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

55One area in which ontology alignment becomes particularly interesting: Natural language and cross-lingual integration

(because this shows very nicely how concepts are not always aligned nicely)

56Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

56

OceanLake

BodyOfWater

River

Stream

Sea

NaturallyOccurringWaterSource

The water ontology [from the Costello&Jacobs OWL tutorial: http://www.racai.ro/EUROLAN-2003/html/presentations/JamesHendler/owl/OWL.ppt]

TributaryBrook

Rivulet

Properties: feedsFrom: River

Properties: emptiesInto: BodyOfWater

(Functional)

(Inverse Functional)

(Inverse)

Properties: containedIn: BodyOfWater

(Transitive)

Properties: connectsTo: NaturallyOccurringWaterSource

(Symmetric)

57Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

57

How could this give rise to a mapping/matching problem?

A river is a natural stream of water, usually freshwater, flowing toward an ocean, a lake, or another stream. In some cases a river flows into the ground or dries up completely before reaching another body of water. Usually larger streams are called rivers while smaller streams are called creeks, brooks, rivulets, rills, and many other terms, but there is no general rule that defines what can be called a river. Sometimes a river is said to be larger than a creek,[1] but this is not always the case.[2]

Une rivière est un cours d'eau qui s'écoule sous l'effet de la gravité et qui se jette dans une autre rivière ou dans un fleuve, contrairement au fleuve qui se jette, lui, dans la mer ou dans l'océan.

Een rivier is een min of meer natuurlijke waterstroom. We onderscheiden oceanische rivieren (in België ook wel stroom genoemd) die in een zee of oceaan uitmonden, en continentale rivieren die in een meer, een moeras of woestijn uitmonden. Een beek is de aanduiding voor een kleine rivier. Tussen beek en rivier ligt meestal een bijrivier.

58Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

58

Sometimes a class needs to restrict the range of a property

OceanLake

BodyOfWater

River

Stream

Sea

NaturallyOccurringWaterSource

TributaryBrook

Rivulet Fleuve

Properties: emptiesInto: BodyOfWater

Since Fleuve is a subclass of River, it inherits emptiesInto.The range for emptiesInto is any BodyOfWater. However,the definition of a Fleuve (French) is: "a River which emptiesIntoa Sea". Thus, in the context of the Flueve class we want therange of emptiesInto restricted to Sea.

59Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

59

Pretty standard: A class and an object property in OWL

<owl:ObjectProperty rdf:ID="emptiesInto"> <rdfs:domain rdf:resource="#River"/> <rdfs:range rdf:resource="#BodyOfWater"/></owl:ObjectProperty>

<owl:Class rdf:ID="River"> <rdfs:subClassOf rdf:resource="#Stream"/></owl:Class>

Note for nerds: Why does this use „rdf:ID“ and not „rdf:about“ (as FOAF does)?

“As for choosing between rdf:ID and rdf:about, you will most likely want to use the former if you are describing a resource that doesn't really have a meaningful location outside the RDF file that describes it. Perhaps it is a local or convenience record, or even a proxy for an abstraction or real-world object (although I recommend you take great care describing such things in RDF as it leads to all sorts of metaphysical confusion; I have a practice of only using RDF to describe records that are meaningful to a computer). rdf:about is usually the way to go when you are referring to a resource with a globally well-known identifier or location.“ (http://www.ibm.com/developerworks/xml/library/x-tiprdfai.html)

60Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

60

Global vs Local Properties

rdfs:range imposes a global restriction on the emptiesInto property, i.e., the rdfs:range value applies to River and all subclasses of River.

As we have seen, in the context of the Fleuve class, we would like the emptiesInto property to have its range restricted to just the Sea class. Thus, for the Fleuve class we want a local definition of emptiesInto.

61Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

61

Defining emptiesInto (when used in Fleuve) to have allValuesFrom the Sea class

<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:owl="http://www.w3.org/2002/07/owl#" xml:base="http://www.geodesy.org/water/naturally-occurring">

<owl:Class rdf:ID="Fleuve"> <rdfs:subClassOf rdf:resource="#River"/> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="#emptiesInto"/> <owl:allValuesFrom rdf:resource="#Sea"/> </owl:Restriction> </rdfs:subClassOf> </owl:Class>

...

</rdf:RDF>

naturally-occurring.owl (snippet)

One way of specifying matching

expressions in OWL ... here: by

model extension

62Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

62

Older brother

Younger brother

Older sister

Younger sister

What about this?Different languages have different (lexicalized) concept boundaries

63Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

63

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

64Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

64

How to compare?

Input: What kind of input data? (What languages? Only toy examples? What external information?)

Output: mapping between attributes or tables, nodes or paths? How much information does the system report?

Quality measures: metrics for accuracy and completeness?

Effort: how much savings of manual effort, how quantified?

Pre-match effort (training of learners, dictionary preparation, ...)

Post-match effort (correction and improvement of the match output)

How are these measured?

65Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

65

Match quality measures

Need a „gold standard“ (the „true“ match)

Measures from information retrieval:

(standard choice: F1, = 0.5)

Quantifies post-match effort

66Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

66

Benchmarking

Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable

Need more standardized conditions (benchmarks)

Now a tradition of competitions in ontology matching (more in the next session):

Test cases and contests at http://www.ontologymatching.org/evaluation.html

67Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

67Example: Tasks 2009 (various are re-used; 2011 is currently running)(excerpt; from http://oaei.ontologymatching.org/2009/)

Expressive ontologies anatomy

The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.

conference Participants will be asked to find all correct correspondences (equivalence and/or

subsumption correspondences) and/or 'interesting correspondences' within a collection of ontologies describing the domain of organising conferences (the domain being well understandable for every researcher). Results will be evaluated a posteriori in part manually and in part by data-mining techniques and logical reasoning techniques. There will also be evaluation against reference mapping based on subset of the whole collection.

Directories and thesauri fishery gears

features four different classification schemes, expressed in OWL, adopted by different fishery information systems in FIM division of FAO. An alignment performed on this 4 schemes should be able to spot out equivalence, or a degree of similarity between the fishing gear types and the groups of gears, such to enable a future exercise of data aggregation cross systems.

Oriented matching This track focuses on the evaluation of alignments that contain other mapping

relations than equivalences.

Instance matching very large crosslingual resources

The purpose of this task (vlcr) is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia.

68Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

68

Mice and humans

The anatomy real world case is about matching the Adult Mouse Anatomy (2744 classes) and the NCI Thesaurus (3304 classes) describing the human anatomy.

(http://oaei.ontologymatching.org/2008/anatomy/)

69Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

69Matching task and evaluation approach(http://oaei.ontologymatching.org/2007/anatomy/)

We would like to gratefully thank Martin Ringwald and Terry Hayamizu (Mouse Genome Informatics - http://www.informatics.jax.org/), who provided us with a reference mapping for these ontologies.

The reference mapping contains only equivalence correspondences between concepts of the ontologies. No correspondences between properties (roles) are specified.

If your system also creates correspondences between properties or correspondences that describe subsumption relations, these results will not influence the evaluation (but can nevertheless be part of your submitted results).

The results of your matching system will be compared to this reference alignment. Therefore, all of the the results have to be delivered in the format specified here.

70Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

70(Some) results(http://oaei.ontologymatching.org/2009/results/anatomy/)

71Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

71

Agenda

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

72Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

72

Example in iMAP

User sees ranked candidates:

1. List-price = price

2. List-price = price * (1 + fee-rate)

Explanation:

a) Both generated from numeric searcher, 2 ranked higher than 1

b) But:

c) Match month-posted = fee-rate

d) domain constraint: matches for month-posted and price do not share attributes

)e cannot match list-price to anything to do with fee-rate

f) Why c)?

g) Data instances of fee-rate were classified as of type date

User corrects this wrong step f), the rest is repaired accordingly

73Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

73

Background knowledge structure for explanation: dependency graph

74Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

74

MOBS: Using mass collaboration to automate data integration

1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.)

2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query

3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings)

4. Combining user feedback (e.g, majority count) Important: „instant gratification“ (e.g., include the new field in the

results page after a user has given helpful input)

75Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

75

Some issues of matching – (not only) when it comes to individuals

“Is this the same entity?“:

What does “the same“ mean anyway?

(When) do we want these inferences?

76Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

76

Outlook

The match problem & what info to use for matching

(Semi-)automated matching: Example CUPID

(Semi-)automated matching: Example iMAP

Ontology matching

Evaluating matching

Core ideas of federated databases

Involving the user: Explanations; mass collaboration

KDD (1): Visualizations for exploratory data analysis

77Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

77

References / background reading; acknowledgements

Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.700

Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine.

http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf

Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference.

http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid

Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.4117

P. Shvaiko, J. Euzenat: A Survey of Schema-based Matching Approaches. Journal on Data Semantics, 2005.

http://www.dit.unitn.it/~p2p/RelatedWork/Matching/JoDS-IV-2005_SurveyMatching-SE.pdf

also interesting: N. Noy: Semantic Integration: A Survey of Ontology-based Approaches. SIGMOD Record, 33(3), 2004. http://www.dit.unitn.it/~p2p/RelatedWork/Matching/13.natasha-10.pdf

Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer.

http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Evaluations.pdf

McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB).

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.9964

Please see the Powerpoint slide-specific „notes“ for URLs of used pictures and formulae