integrating schema integration frameworks,...

21
Integrating schema integration frameworks, algebraically * Technical Report CSRG-583 Department of Computer Science, University of Toronto, 2008 Zinovy Diskin Steve Easterbrook Ren´ ee Miller {zdiskin, sme, miller}@cs.toronto.edu Abstract The paper presents a framework, in which the main concepts of schema and data integration can be spec- ified both semantically and syntactically in an ab- stract data-model independent way. We first define what are schema matching and integration seman- tically, in terms of sets of instances and mappings between them. We also define a schema matching and integration syntactically, and introduce a pro- cedure (the how) for computing the integration of matched schema according to the syntactical defini- tion in fairly abstract terms. The main theorem of the paper states that the result of syntactical inte- gration satisfies the semantic definition and, hence, does produce what we really need. We show how this framework unifies the approach taken in at least five other schema integration proposals and fills in some important gaps in these proposals. Viewed in a broader perspective, the framework developed in the paper integrates the syntactical and semantical sides of model management and, particularly, reveals a re- * Supported by Bell Canada through the Bell University Labs, NSERC and Ontario Centres Of Excellence markable duality between them. The results of the paper can then be seen as an (important yet) partic- ular case of this general duality theory. Contents 1 Introduction 2 1.1 The problems .............. 2 1.2 The causes ............... 3 1.3 The approach .............. 3 1.4 Some highlights ............. 4 2 Background. 5 2.1 Manual vs. semi-automatic (heuristics vs. algebra) ......... 5 2.2 How to specify and manage structural conflicts ................. 7 2.3 Semantic justification of syntactic pro- cedures ................. 7 2.4 Summary of Schema Integration Ap- proaches ................. 8 3 Sample scenario 8 3.1 Example of relational schema integration 8 1

Upload: others

Post on 15-Oct-2019

34 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Integrating schema integrationframeworks, algebraically∗

Technical Report CSRG-583

Department of Computer Science,

University of Toronto, 2008

Zinovy Diskin Steve Easterbrook Renee Miller{zdiskin, sme, miller}@cs.toronto.edu

Abstract

The paper presents a framework, in which the mainconcepts of schema and data integration can be spec-ified both semantically and syntactically in an ab-stract data-model independent way. We first definewhat are schema matching and integration seman-tically, in terms of sets of instances and mappingsbetween them. We also define a schema matchingand integration syntactically, and introduce a pro-cedure (the how) for computing the integration ofmatched schema according to the syntactical defini-tion in fairly abstract terms. The main theorem ofthe paper states that the result of syntactical inte-gration satisfies the semantic definition and, hence,does produce what we really need. We show howthis framework unifies the approach taken in at leastfive other schema integration proposals and fills insome important gaps in these proposals. Viewed in abroader perspective, the framework developed in thepaper integrates the syntactical and semantical sidesof model management and, particularly, reveals a re-

∗Supported by Bell Canada through the Bell UniversityLabs, NSERC and Ontario Centres Of Excellence

markable duality between them. The results of thepaper can then be seen as an (important yet) partic-ular case of this general duality theory.

Contents

1 Introduction 21.1 The problems . . . . . . . . . . . . . . 21.2 The causes . . . . . . . . . . . . . . . 31.3 The approach . . . . . . . . . . . . . . 31.4 Some highlights . . . . . . . . . . . . . 4

2 Background. 52.1 Manual vs. semi-automatic

(heuristics vs. algebra) . . . . . . . . . 52.2 How to specify and manage structural

conflicts . . . . . . . . . . . . . . . . . 72.3 Semantic justification of syntactic pro-

cedures . . . . . . . . . . . . . . . . . 72.4 Summary of Schema Integration Ap-

proaches . . . . . . . . . . . . . . . . . 8

3 Sample scenario 83.1 Example of relational schema integration 8

1

Page 2: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

3.2 Abstract arrangement . . . . . . . . . 113.3 Discussion . . . . . . . . . . . . . . . . 11

4 Universal data model 124.1 Example of data definition with dp-

graphs . . . . . . . . . . . . . . . . . . 124.2 Querying dp-graphs . . . . . . . . . . 134.3 Universality statement. . . . . . . . . . 134.4 FAQ about dp-graphs . . . . . . . . . 14

5 Schema matching as query discoveryand equating 145.1 Vanilla’s constructs via dp-graphs . . . 15

5.1.1 Preliminaries. . . . . . . . . . . 155.1.2 Synonymy . . . . . . . . . . . . 155.1.3 Homonymy . . . . . . . . . . . 16

5.2 More on conflicts . . . . . . . . . . . . 17

6 Conclusion 18

1 Introduction

Schema integration (lately, schema/model merge) isthe problem of building a global data schema from aset of local schemas (usually with overlapping seman-tics) to provide the user with a unified view of the en-tire dataset. The goal is to give the user of the globalschema an illusion of a single dataset specified by asingle schema. Schema integration appears in manymetadata management tasks, e.g., view integrationfor database design, building mediated schemas fordata integration and warehousing, merging ontolo-gies and several others. It is a well-known researchtopic with an enormous literature. Early works aresurveyed in [BLN86, SL90], and references to morerecent work can be found in [Len02, PB03]. Yet a suf-ficiently general, theoretically sound and practicallyapplicable solution still seems to be missing from theliterature.

1.1 The problems

Three inter-related aspects of the problem contributeto its complexity. The first is structural conflict be-tween local schemas. The same data can be viewedquite differently by different users and hence may be

poetry art

Book

title topic science

Book

Sci.

Art

Poetry

title

V1 V2

Figure 1: Structural conflict/relativism: two views ofthe same data

Book

title isbn auth-name

Book

title isbn

Author

name RBA

V1 V3

Author

name books

V2

m

n

Figure 2: Semantic relativism: three views on thesame data

specified differently in local schemas. For example,one schema may have an attribute fullName while theother has attributes f-name and l-name (first and lastnames respectively). Or, an enumeration-valued at-tribute in one schema might correspond to a subclass-ing hierarchy in another schema (Fig. 1). Or, whatis an entity for one user, may be a relationship foranother, and an attribute for the third (Fig. 2). Forintegration, these conflicts must be precisely speci-fied and then reconciled. Though many particularapproaches and languages for this purpose have beenproposed, a simple yet general definition of what is astructural conflict is missing from the literature.

The second source of complexity is extreme hetero-geneity of the modern metadata management. Localschemas are usually built in different data models(relational, XML, ER, UML...), and new domain-specific languages appearing on the stage continueto increase the diversity. A typical approach to theproblem is first to translate local schemas into someuniversal data model U , and then integrate themwithin it. A few questions immediately arise. IsU sufficiently expressive to accurately represent lo-cal schemas without loss of information? If so, hownatural are these translations, and do they requiresignificant restructuring? This question becomes es-pecially important if local schemas are actually pop-ulated with data. A particular work on schema inte-

2

Page 3: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

gration typically argues that the chosen data model issufficiently universal for “practical purposes”, with-out providing any formal or semi-formal arguments tosupport such a claim, nor estimating the real scopeof universality. The only justification of these choicesfor U is common sense, supported by a series of ex-amples. This approach is perhaps sufficient for well-known models like the relational model, XML, andthe ER model, for which rich intuition and experi-ence have been accumulated. Yet for newer modelslike, say, a new UML-profile, or a new domain-specificdata model (which the Web never hesitates to pro-vide) require more precise and formal arguments.

Anyway, the arsenal of modeling constructs in Umust be rich enough to simulate structuring capabil-ities of the local models, and we return again to theissue of structural and semantic conflicts between theviews. To capture the diversity of conflicts, languagesfor specifying view overlapping become complicatedand hard to use. Moreover, the algorithms for schemamerging are also very complex because they must ad-dress each type of conflict and manage it correspond-ingly.

Finally, most current frameworks for schema inte-gration lack declarative semantic definitions of whatschema matching and merging are. An integra-tion framework usually provides a syntactical algo-rithm for schema merging based on given correspon-dences, and justifies it by some reasonable argu-ments. However, in the absence of a semantically-based declarative definition, one cannot be sure thatone’s merge algorithm is sound and computes exactlywhat one needs. In fact, even semi-formal descrip-tions of the schema merge problem in terms of data-base states/schema instances are uncommon and cer-tainly not widely shared among approaches. An ex-cellent early work [BC86] does address semantics butseems to be a notable exception rather than a rule.1

1Paper [BDK92] provides a syntactic declarative definitionof merged schema as the least upper bound in the lattice ofschemas ordered by their information contents, and [AB01] ex-tends it by working with a graph of schemas and schema map-pings rather than a lattice. However, neither of these workstakes derived schema elements (queries) into account, whichsignificantly lessens their practical applicability.

1.2 The causes

Overlapping between local schemas is a semantic phe-nomenon, and the problem of schema integration isinherently semantic. The crucial observation is thatthe query mechanism plays a major role in recon-ciling conflicts. This is evident for the schemas inFig. 1, and in the paper we will show that views inFig. 2 can also be reconciled with a few simple butgraph-based query operations. However, the lack ofa precise algebraic machinery for specifying queriesto semantically rich models has led to ad hoc ap-proaches. For example, the schema integration ap-proach of [PB03] relies on user provided expressionsfor conflict resolution. In general, modern databasetheory brings a schema integration practitioner tothe following alternative. Either one works with theprecisely defined and well-understood, but semanti-cally inflexible relational model and relational alge-bra. Or one uses graphical and semantically richmodels like ER-diagrams with various enrichments,whose semantics is intuitive rather than formal, andwhose query mechanism is not elaborated. For mostpractical situations, the second alternative is prefer-able, and current schema integration methodologiesresult in complicated solutions lacking precise seman-tic foundations.

1.3 The approach

The goal of this paper is not to present yet anothersemantic data model and a schema integration algo-rithm. Our aim is to define a general specificationframework, where the major ingredients of schemaintegration – matching, merging and (we will see)normalizing the merge – are clearly specified and sup-plied with precise formal semantics. In a sense, weaim to build a clear ontology of schema integrationprocedure, in which different integration steps areclearly separated so that each step is realized with itsown methods and tools. In particular, with a prop-erly addressed match, the very merge is a sufficientlytrivial automatic algebraic procedure.2

2The result is always a well-defined schema but it may turnout to be inconsistent, which means that the local schemas areglobally inconsistent. For poor semantic models, inconsistency

3

Page 4: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Technically, the approach is based on the machin-ery of graphs with diagram predicates (dp-graphs) in-troduced in [DK03]. We define a generic notion ofdp-graph, while the signature of diagram predicates isuser-defined. (The approach is reminiscent of the waythat first-order logic provides the mechanism to buildformulas over a given signature of predicate symbols).Some diagram predicates can actually define opera-tions; for example, a ternary predicate add(x,y,z) canbe made equivalent to an operation x=y+z. Thisgives us a framework for query language definitionwithin dp-graphs. The reader may think of a frame-work where the user can define a sort of relationalalgebra suitable for applications.

In a nutshell, our integration strategy is a sequenceof the following steps.

1. Given a set of local schemas, design a predicatesignature rich enough to provide translation ofthe local schemas into dp-graphs. In general,this is a heuristic procedure but if semantics ofthe local schemas is well understood, design andtranslation are not problematic.

2. Match schemas and specify the results by a setof equations between queries against the localschemas. In fact, any correspondence can bespecified in this way if the query language is richenough. To say it briefly, schema matching isquery discovery and equating.

3. Given a set of equations, run the merge algo-rithm, which essentially produces the disjointunion of the local dp-graphs augmented withderived elements, and factorizes it by the equa-tions.

The algorithm also carries the predicates de-clared in the local dp-graphs to the merged dp-graph. Some of these predicates are query speci-fications introduced during the matching phase;

can be automatically detected. For practically interesting se-mantic models, inconsistency can be immediately detected insimple cases but in general should be checked with the cor-responding tools: theorem provers and model checkers. Con-sistency may be algorithmically undecidable [Con86] or unde-cidable due to the expressiveness of the constraints considered[?]. Thus, the result of schema merge should be checked forconsistency.

it means that the merge contains derived ele-ments. Other predicates are constraints declaredin local schemas.

4. Run a model checker/theorem prover to checkand analyze consistency of the merged con-straints.

5. If the merged dp-graph is consistent, normalizeit, that is, remove derived elements (as much aspossible) to reduce redundancy.

1.4 Some highlights

• The language of dp-graphs is provably univer-sal: we show in the paper that any data schemawith formal semantics can be translated to a dp-graph.3 Semantics for dp-graphs is truly com-positional: it can be presented as a graph mor-phism from the schema-graph to the universe-graph. In the latter, the nodes are sets (ofobjects and values), and the arrows are func-tions between them. Arranging semantics asa morphism allows us to relate operations overschemas (the level of syntax) to operations oversets of instances (semantics).

• The results of matching (query equations) arereified as a new correspondence model/schemaequipped with projection mappings into localschemas augmented with necessary derived el-ements. This gives us a universal patternfor recording all the information necessary forschema merging in a compact way.

• We provide a declarative syntactic definition ofwhat is schema merge, and present it as a (graph-based) algebraic operation. We also provide a se-mantic (that is, instance-based) declarative defi-nition of merge. A proof that these two declara-tive constructions define the same construct (upto renaming of elements) is presented in [Dis06].4

• The result of merge contains derived elementsand hence can be restructured by making some

3We also explain the meaning of provability in this context.4The proof is unnecessarily complicated. A short transpar-

ent proof will be presented elsewhere/

4

Page 5: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

of the derived elements basic and original basicelements derived. We call schemas differentiatedby this choice der-equivalent. We show in thepaper that the correspondence schema as wellas the merge schema are determined up to der-equivalence.

Some parts of the framework outlined above arenovel or may seem novel, but on a whole it is not aradical departure from the existing approaches andideas. For example, the dp-graphs can be seen as a(far reaching) generalization of the familiar, but nowalmost forgotten, functional data model [Shi81]. Asfor integration as such, almost all parts of the pictureare known to the community but in a implicit or fuzzyway. Table 1 makes this observation explicit. Therows present the main actors and the columns referto several well known articles. For each row thereis at least one column with a positive mark (thoughoften with reservations). The framework presentedin the present paper is thus a sort of closure: weformulate the ideas in precise terms, fill-in the gapsand integrate the pieces into a coherent frameworkfor schema integration.

The rest of the paper is structured as follows. Inthe next section we briefly survey the existing ideasand works. Section 3 presents the essence of the in-tegration procedure in our approach; the goal is togive the reader a general notion while details are pre-sented in the consecutive sections.

2 Background.

Existing approaches to schema merging can be clas-sified along the following dimensions.

2.1 Manual vs. semi-automatic(heuristics vs. algebra)

In early approaches like [DH84, Mot87]), schema in-tegration is performed by schema restructuring, ie,by consecutively applying structural operations takenfrom a predefined collection to the local schemas.Which operations to take and how to apply themis the responsibility of the global schema designer.

The entire process is thus essentially heuristic andhuman-centric, though the designer may be aidedby special tools. Another approach, originated in[SPD92, SP92]splits integration into two principallydifferent phases. In the first one, correspondence be-tween local schemas are specified by a set of asser-tions in a special language of correspondence asser-tions. Then the global schema is built automaticallybased on the correspondences specified. In moderngeneric model management parlance [Ber03], bothphases are considered as operations over schemasand are called, respectively, model match and modelmerge. However, there is an important difference be-tween them. Model matching is a heuristic processof discovering correspondences between schemas and,strictly speaking, is not an algebraic operation. Forexample, in the case of two local schemas S1 andS2, the correspondence specification match(S1, S2)is determined not only by the operands S1, S2, butby their semantics and other context dependant fac-tors not specified by the schemas. Thus, an expres-sion C12 = match(S1, S2) refers to a semi-heuristicprocess rather than to an algebraic term. Thisprocess can be assisted with intelligent tools usingideas from machine learning and natural language,but a certain heuristic component is principally in-evitable. The output of this process largely deter-mines the types of discovery tools used. When thecorrespondence assertions are simple (uninterpreted)lines, the correspondence specification is commonlycalled a schema matching [RB01, MBR01, LSDR07],when the assertions are queries or general GLAV(global-and-local-as-view) constraints with a formalsemantics, the specification is typically called aschema mapping [MHH00, PVM+02].

5

Page 6: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Mot

87

BD

K92

SP

94

CD

96

AB

01

PB03

FK

MP0

5

Unive

rse

of sc

hem

as

Notio

n of s

chem

a Ve

rsion

of

Func

tiona

l mo

del (F

M)

Rich

er

FM-ve

rsion

Ve

rsion

of E

R-mo

del, E

RC+

Grap

h with

diag

ram

pred

icates

and

oper

ation

s

Abstr

act o

bject

with

str

uctur

e imp

licit

Rich

er

FM ve

rsion

Re

lation

al

The u

niver

se is

seen

as

Set

Lattic

e Se

t Ca

tegor

y Ca

tegor

y La

ttice

N/A

Is du

ality

betw

een s

yntax

an

d sem

antic

s add

ress

ed?

No

No

No

No

Yes b

ut ab

strac

tly:

postu

lated

but n

ot

expla

ined

No

Yes

Sche

ma

mat

ching

:

How

inter

sche

ma co

rresp

. is

spec

ified

Sche

ma

restr

uctur

ing

Name

co

incide

nce

Corre

spon

denc

e as

sertio

ns

Arro

w sp

an

Arro

w sp

an

Arro

w sp

an

TGDs

Does

deriv

ed in

fo pla

y an

impo

rtant

role?

Ye

s Mi

nimall

y and

im

plicit

ly No

Ye

s No

Im

plicit

ly, vi

a ex

pres

sions

on

mapp

ing el

emen

ts

Yes

Decla

rativ

e syn

tactic

def.

No

No

No

Yes

Yes

Yes

Yes

Decla

rativ

e sem

antic

def.

No

No

No

Yes

No

No

Yes

Sche

ma

mer

ging:

De

clara

tive s

yntac

tic de

f. No

Ye

s, alg

.oper

ation

of

lub

No, ju

st an

alg

orith

m Ye

s

Yes,

alg. o

pera

tion o

f co

limit

Yes,

alg.op

erati

on of

lub

Do

es de

rived

info

play a

n im

porta

nt ro

le?

Yes

Minim

ally a

nd im

plicit

. (vi

a clos

ure c

ond.)

No

Ye

s No

Im

plicit

ly, vi

a ex

pres

sions

Is

merg

e dete

rmine

d up t

o de

r-equ

ivalen

ce?

No

No

No

Yes

No

Some

what

(par

tially

, impli

citly)

Decla

rativ

e sem

antic

def.

No

No

No

No

No

No

N/A

Tab

le1:

Surv

eyta

ble

6

Page 7: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

In contrast to the semi-heuristic match operator,when correspondences between schemas are given bysome specification C12, the global schema G can po-tentially be computed automatically. That is, it isuniquely determined and can be algebraically writ-ten as

(1) G = merge(S1, S2, C12),

where merge refers to a ternary operation S × S ×C → S with S denoting the universe of schemas andC the universe of correspondence specifications.

An important question of the automatic approachis what are the objects of the universe C? In earlywork on automatic integration [SPD92] they arestatements in a special language of correspondenceassertions tailored for a particular data model ERC+,an enriched version of ER-diagrams. More recently,GLAV schema mappings, most commonly specifiedas formulas of first-order logic adapted for the re-lational data model are used [Len02, FKMP05]. Adisadvantage of these specifications for schema merg-ing is that the third operand C12 is conceptually andstructurally different from the first two operands and,thus, merge needs to handle two very different arti-facts in a coherent way.

A different and prominent idea proposed in [CD96,AB01, PB03] is to reify the correspondence specifica-tion as a schema, say, S0, equipped with two func-tions (projections) f0i : S0 → Si, i = 1, 2 into localschemas. In mathematical category theory, a set offunctions/arrows with a common source is called aspan and thus a correspondence specification is aspan

C12 =(S1

�f01S0

f02- S2

).

Schema S0 is called the head of the spans and func-tions f0i, i = 1, 2 are arms. In modern meta-data management literature, schemas are often calledmodels and the entire span configuration as above amodel mapping. Also, the projection functions areoften left implicit.

2.2 How to specify and manage struc-tural conflicts

Whatever language/format is used for specifying cor-respondences, the main question is whether it isexpressive enough to capture overlapping of localschema semantics in precise terms. Moreover, if wespeak about an automatic phase of schema merge, thelanguage must be formal. The issue is non-trivial be-cause of structural (sometimes called semantic) con-flicts between local views.

Obviously, for automatic schema merge we needa precise taxonomy and specification of possible con-flicts. In [SP92], a taxonomy distinguishing semantic,descriptional, structural and heterogeneity conflictsis presented. However, their description is largelyinformal and is tailored for a particular data modelERC+. Paper [PB03] elaborates the issue further andpresents a more precise taxonomy in terms of a muchmore universal tree-based schema format. They dis-tinguish representation, meta-model and fundamen-tal conflicts. The former address the possibility ofdifferent representations of the same data (structuralconflict) while the latter address the integrity of themerged schema after absorption of the informationfrom the local schemas. The meta-model conflicts areconflicts of translations between particular schemalanguages and the universal language used in [PB03].Very simple conflicts are well managed in [PB03] bythe very span format for correspondence; for morecomplex conflicts, a mechanism of algebraic expres-sions attached to mapping elements is used. Unfor-tunately, the details of the expression language arenot specified, and it decreases the precision of themachinery making it partially heuristic.

2.3 Semantic justification of syntacticprocedures

In the present paper, by semantics we mean consid-ering database states or else schema instances in aexplicit and precise way. That is, if S is a schema,we consider its set of instances inst(S), functions de-fined on them and operations over them. Note thatthe term “semantics” is often used for talking aboutpurely syntactical requirements intended to address

7

Page 8: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

semantic issues. In this case, semantics is implicitand (at least because of this) is informal.

The consideration of semantics in a precise way isimportant for schema integration. Consider the fol-lowing questions. Is a given pool of restructuringoperations, or a given correspondence assertion lan-guage expressive enough for specifying all possibleconflicts between local schemas, or all possibilities oftheir overlapping? Can a given data model serve asa common universal model for the local models, or towhat extent is it universal, or how natural are trans-lations from the local models into this common datamodel? These questions are essentially semantic andcan be answered only in a formal semantics frame-work. However, formal semantics is rarely addressedin schema integration approaches (with a few notableexceptions [BC86]), particularly those based on infor-mal, but expressive, semantic data models.

The majority of schema integration approaches in-clude clear syntactical procedures that are motivatedby reasonable, but informal, semantic considerations.Thus, semantics in the sense above is not addressedand the question of whether the result of a syntacti-cally defined procedure is semantically sound, that is,whether we compute what we really need to compute,remains more a subject of belief rather than formalverification. In many approaches there are no declar-ative definitions of what is the merged schema at all;hence the merge algorithm is simultaneously a defi-nition of what is computed. A declarative (but stillsyntactical) definition of what is schema merging ap-peared first in [BDK92]. They organize the universeof schemas into a set partially ordered by inclusionsbetween schemas, and show that this poset is a lubsemi-lattice (that is, the least upper bound of twoschemas always exists). Paper [PB03] significantlydeveloped this idea by applying it for a much richertype of schemas and for a much richer set of possi-ble correspondences between them. Unfortunately,their framework becomes less precise as soon as cor-respondence specifications between schemas involveexpressions attached to mapping elements.

On the other hand, research on the data exchangeproblem is properly semantically based from the verybeginning [FKMP05]. They first define the notion ofmapping between schemas and its derivatives seman-

tically and then consider necessary syntactical meansto specify these notions. There are two obstacles indirect use of these results for schema integration: (i)the context of data exchange is different and (ii) theformalisms is tightly connected to the relational datamodel.

2.4 Summary of Schema IntegrationApproaches

In Table 1, we present a summary of our observationsabout the state-of-the-art in schema integration. Weinclude six papers on schema integration methods,along with a paper on data exchange. Although thelast paper does not include a schema merging opera-tion, it is notable for including a declarative seman-tics for schema matching. It is also relevant to theapproach we will be presenting because it explicitlyrecognizes that structural (semantic) conflicts can beresolved by explicitly representing derived informa-tion as queries.5 This same observation was mademuch earlier by Motro [Mot87] and [CD96], but omit-ted from later integration work.

As we have noted, all of these approaches have lim-itations in some important aspects. Our goal in thispaper is to present a schema integration approachwithout these limitations and in a conceptually clearway. In the next section, we present an example ofschema integration of two relational schemas. We usethe example to show the main phases of our approachand the main problems to be solved.

3 Sample scenario

3.1 Example of relational schema in-tegration

Consider two simple relational schemas in the toppart of Fig. 3. The schema S1 consists of two tables,Orders and Hardware, each having three columnswhose names are written under the table names. Thesecond schema consists of one three-column table

5Indeed an early paper on discovering schema mappings fordata exchange was called “Schema Mapping as Query Discov-ery” [MHH00].

8

Page 9: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

(i) Heuristics: Specifying correspondence by query discovery.

(ii) Algebra: Merge operation (iii) A mixture: Normalization

Abstract description of the example above

eΣeN [ ≅ ]

SN

SΣ derqΣ SΣ

derqN SN

f2f1S1

derq1S1 derq2S2

S2S0

v1 v2

[merge]

S1

derq1S derq2S

S2 S0

mapping v2

schema S1+= derq1S1 Orders Hardware T=πc,cN (σ vend=”HP”(Ord ⊗Hdw))

customer# product# col1=customer-proj custName prodName col2=custName-proj product# vendor

schema S0 HP-customer

customer# name

schema S2 Customer

customer# name bDate

schema SΣ Orders Hardware HP-Customer

customer# product# customer# = col1 of T name prodName name = col2 of T product# vendor bDate

schema S2 Customer

customer# name bDate

schema S1 Orders Hardware

customer# product# custName prodName product# vendor

Two relational schemas to be integrated

Specifying correspondence via span (S0,f1,f2) and merging

SΣ = S1 ⊕(S0, f1, f2) S2

Removing derived elements (Normalization)

mapping f1 mapping f2 query q1

schema SN Orders Hardware HP-Customer

customer# product# customer# name prodName bDate product# vendor

mapping v1

Constraint C: πc (σ vend=”HP”(Ord ⊗Hdw) = πc(HP-Cust)

New elements are dashed-dotted and green

New elements are dashed-dotted and green New elements are

dashed and blue

Figure 3: Example of relational schema integration. [Derived elements are shaded (and blue in color print). In long expressions, names of tables and columns are abbreviated by a few letters.]

9

Page 10: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Customer. Suppose we know that the set of cus-tomers referred to by schema S2 is exactly the setof those customers for schema S1, who ordered prod-ucts of vendor “HP”. This latter set can be computedby an evident query q1 specified in schema derq1

S1 inFig. 3.

Our first step is to specify the correspondence be-tween schemas in a formal and sufficiently generalway. To this end, we augment schema S1 with aderived table T[col1,col2]: see Fig. 3 where the de-rived elements are shaded. Table T is computed byquery q1 as specified in the Figure. We denote theaugmented schema by derq1

S1 or S1+. Then we in-

troduce a new schema S0 with the only table HP-Customer and show how this table is represented inthe local schemas. This work is done by schema map-pings f0j : S0 → Sj

+ = derqjSj (j = 1, 2), which map

elements of the middle schema to the correspondingelements, either basic or derived, of the local schemas.In our particular example, S2

+ = S2. Thus, the cor-respondence information is specified by the followingconfiguration of schemas and mappings between them

C12 =(derq1

S1�f01 S0

f02- derq2S2

).

We will call such a configuration a span.The span C12 can be thought of as a set of equa-

tions between local schema elements like col1@S1+

= customer#@S2 and col2@S1+ = name@S2, where

expression e@S means that e ∈ S. These equationsare necessary for the merge algorithm to eliminateduplication of elements in the merged schema.

The correspondence span we have built has a clearsemantic meaning. Note that mapping f01 defines aview to schema S1 (with schema S0 being the viewschema). Semantically, it means that for any in-stance I ∈ inst(S1), the result of view/query exe-cution is defined as [[ f01 ]](I) ∈ inst(S0). Here weuse the following notation: if X is a syntactic notion,a schema or mapping between schemas, then [[X ]] isits semantic meaning, the set of instances or the map-ping between these sets respectively. The same is formapping f02, in our case trivial. Thus, the seman-tic meaning of the span declaration is the followingequality: [[ f01 ]](I1) = [[ f02 ]](I2), where Ij is an in-

stance of schema Sj for j = 1, 2. That is, S0 specifiesthe shared information in S1 and S2.

The natural next step is to merge the augmentedlocal schemas disjointedly, and then glue together theelements that are declared equal in the correspon-dence span.The resulting schema SΣ will be a disjointunion of the three components:

(2) SΣ∼= (S1

+ \ S0) ] S0 ] (S2+ \ S0).6

In addition, we obtain mappings (views) between thelocal schemas and the merge, v1 and v2 in the figure.Thus, the result of integration is a configuration ofschemas and mappings S = (SΣ, v1, v2) specified inFig. 3. We will call such configurations sinks.

If schemas in the correspondence span contain de-rived elements, the merge schema will also containderived elements. To finish integration, we may liketo remove from SΣ derived (shaded) items. In generalthis process is not trivial since removal derived itemsmay violate some structural requirements to schemas.For instance, in our example, removing the derivedcolumn cutomer# will leave the table HP-Customerwithout its primary key. Hence, we forced to keepit in the schema but then add to it a correspondingintegrity constraint: this is Constraint C in Fig. 3,which asserts equality of the results of two queries.

Thus, an important final part of the integra-tion procedure should be a special procedure ofschema normalization. The latter is aimed at find-ing some schema SN with minimal redundancy butder-equivalent to the merge schema. That is, eachelement of the merge schema SΣ can be derived fromSN (no information is lost) and conversely, each el-ement of SN is derivable from SΣ (nothing extra isacquired). It is easy to see that our schema SΣ can befurther normalized but a detailed discussion goes be-yond the goals of this paper but has been addressedby others [AH88].

6This would be a quite precise description if schemas weresets of elements. However, relational, XML and other schemasused in practice are much richer structures, and the mergeprocedure must be compatible with it. Later we will addressthe issue.

10

Page 11: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

3.2 Abstract arrangement

The diagrams in the bottom part of Fig. 3 presentan abstract view of the procedures we used in theexample. This view is based on abstract graphs andoperations over them. Nodes of these graphs denoteschemas (considered as structured sets of elements)and arrows are mappings between them (preservingthe structure); hooked arrows denote inclusions. Indiagram (ii), label join denotes the graph-based ana-log of the familiar lattice-theoretic operation of tak-ing the least upper bound (cf. [BDK92]). This analogis defined as follows. Mappings vi : Si → SΣ, i = 1, 2say that all the information contained in the localschemas is contained in the merged schema as wellwithout duplication. The latter condition is pro-vided by the commutativity of the diamond formedby the arrows: for any element e ∈ S0, equalitye.f01.v1 = e.f02.v2 holds in SΣ. Thus, information oflocal schemas is properly passed to SΣ and no infor-mation is lost (SΣ is an upper bound). To formulatethat nothing extra is acquired (SΣ is the least up-per bound), we take an arbitrary schema S′ togetherwith mappings v′

i : Si → S′, i = 1, 2 such that thediamond (f01, v

′1, f02, v2) is commutative. That is,

schema S′ also properly (without duplication) con-tains the information of the local schemas (anotherupper bound). The minimality property of SΣ thenmeans that for any such S′ there is a uniquely de-fined mapping ! : SΣ → S′ such that every diagramformed by all the arrows involved is commutative.In mathematical category theory such an operationon objects is called colimit. Thus, it is reasonableto define the result of merging schemas matches bya span as the colimit of the span. We will use themore familiar term (arrow) merge as a synonym forcolimit. Note that the result of [merge] is a config-uration of schemas and mappings, a cospan or sinkS = (SΣ, v1, v2), rather than the merged schema it-self.

A well-known result of mathematical category the-ory says that for a wide class of meta-models, theclasses of schemas defined by those metamodels areclosed under arrow merge. Moreover, for this classof metamodels, merge defined above declaratively,can be defined constructively as well. Roughly, a

merge is computed by taking disjointed union of thetwo schemas followed by factorization according tothe “equations” specified by the correspondence spanC12 = (S0, f01, f02). A typical algorithm of this sortfor the case of graphs is presented in [SE05]. It canbe immediately generalized for any schema consist-ing of sets of elements and mappings between them[BW95]. In this way we obtain an effective procedurefor computing merged schemas defined w.r.t. their in-formation content. Finally, it can be proved that thedeclarative syntactical definition above is equivalentto a reasonable semantic definition of merge in termsof instances [Dis06].

3.3 Discussion

We have seen that schema integration consists ofthree consecutive steps: matching, merging, and nor-malizing the merge. The first and the third arecontext-dependant heuristic procedures (that may beassisted by tools but not fully automated), the sec-ond step is an algebraic operation and can be en-tirely automatic. The key to a proper integration isnot in the merge itself but in a proper specification oftheir overlapping/correspondences often called struc-tural conflicts. Below we call them conflicts.

A major integration issue is thus how to classifyand specify conflicts between schemas: a few at-tempts were made in the literature, most notable andsystematic are [SPD92, PB03]). These classificationsare quite complicated and lead to considerable com-plexity of the merge algorithms. A major contribu-tion of the present paper is to show that all conflictsbetween views can be uniformly described as in ourexample above: a schema element basic in one viewcan be derived in another view (and vice versa). Ingeneral, a conflict appears as the “sameness” of twoelements, e1 ' e2, where ei denotes the result of ap-plying a query qi to schema Si, i = 1, 2. Then themerge algorithm itself is a purely algebraic procedureand is simple.

The first step towards drawing this base is toavoid syntactical idiosyncracies of relational or otherparticular data models and consider the integrationproblem on the basis of what we call a universal datamodel; the query language is also included as part of

11

Page 12: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

the model. A universal model must (i) be tailoredfor specifying semantics in as direct a way as possi-ble, (ii) be semantically rich to allow specification ofa wide class of conflicts, (iii) allow for natural transla-tion from practically used data models like relational,XML or ER and yet, (iv) be syntactically manageableand semantically suggestive. We will present a frame-work, in which it is possible to build such model, dp-graphs, in the next section. After that, we will showhow various conflicts identified in the literature canbe specified by query equations over dp-graphs.

4 Universal data model

Dp-graphs were introduced (under the name ofsketch) and applied to data modeling in [DK03]. Toexplain the idea, we begin with an example. It is sim-ple but sufficiently generic to present the language ofdp-graphs and its expressive capabilities.

4.1 Example of data definition withdp-graphs

Fig. 4(a) presents a simple ER-diagram. Semanticsof its elements is clear from their names but we wantto make it precise. Let t be a time moment, or thedatabase/real-world state at moment t. The rec-tangle nodes give us four sets [[ Book ]]t, [[Author ]]t,[[ Shelf ]]t and [[Building ]]t. Attributes are functionstitle : [[Book ]]t → [[ str ]], [[ tel ]]t : [[Author ]]t → [[ int ]]etc, where [[ str ]], [[ int ]] are extensions of primitivetypes, which do not depend on time. Note that do-mains of the attribute are implicit in ER-diagramsand we used semantics of names to reveal them. Di-amond nodes denote relations: [[RBA ]]t ⊂ [[ Book ]]t×[[ Author ]]t etc. Relations [[RBS ]]t and [[RSB ]]t are oftype many-to-one and hence are functions at any mo-ment. Furthermore we will omit the index t keepingit in mind. We will also omit the square brackets ifit does not lead to confusion.

Formal explication of what are keys, relative keysand weak entity sets need slightly more accurate for-mulations, and we proceed to Fig. 4(b). In this dia-gram, nodes denote sets and arrows are functions, ei-ther single-valued (an ordinary head) or multi-valued

(double-head). Rectangles denote sets with extensionchanging in time (classes), and ovals denote primitivetypes with permanent predefined extension. A fam-ily of functions with a common source fi : X → Yi,i = 1..n is called a key if for any two different ele-ments of the source, a, b ∈ X, there is at least onefunction fi in the family such that fi(a) 6= fi(b).Then in the diagram, which is a syntactical object,label [key] “hung” on the pair of arrows name andbdate means that at any time moment t, the pairof functions [[ name ]]t and [[ bdate ]]t is a key. Simi-larly, declaring the pair of arrows author and bookto be a key to the node Authorship actually meansthat extension of this node is a set of pairs (a, b) witha ∈ [[ Author ]]t and b ∈ [[Book ]]t. If a single arrow isa key, we mark it with a short bar crossing the tail.

Note that node Book in the ER-diagram is declaredto be a weak entity (double-frame). Suppose that itmeans that each book participate in at least one Au-thorship and hence depends on it (or the correspond-ing author). Formally, it means that the function[[ book ]]t is surjective (or covering) at any time mo-ment; hence, the label [cover] on arrow book.

Since relationships RBS and RSB are of type many-to-one, they are in fact functions and we replace themby arrows on : Book → Shelf and in : Shelf → Building.Note the label [key] hung on the pair (num, in) –this the precise formal meaning of declaring the nodeShelf to be a weak entity in the sense that its keynum is relative to the Building the shelf is in. Notethat this semantics of weak entity is quite differentfrom semantics of node Book, and the difference ismade explicit in our dp-graph. In the literature onER-diagrams, there is one more meaning of being aweak entity, which is usually called existence depen-dency. Namely, in our context, if a Building is deletedfrom the universe (the database), all its Shelves arealso deleted. This is a dynamic property of function[[ has ]]t, which we make explicit by declaring pred-icate [contains] for arrow has.7 Formal definition

7In addition, as a syntactic sugar tip, we also make thetail of the arrows a black diamond: this notation is borrowedfrom UML and is now common. Note that usually semanticsof black diamond includes not only existence dependency (ourpredicate [contains]) but also non-sharing: every shelf belongsto one and only one building. In our dp-graph this is captured

12

Page 13: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

(a) an ER-diagram with weak entity sets

(b) dp-graph specifying intended semantics of diagram (a) (names of diagram predicate are bracketed by [-] and red)

(c) a few simple queries (names of diagram operations are bracketed by

[-⟩ and blue; derived elements are dashed and blue)

/bo-au--name

book

//age

/age /bo-au

name bdate

yy Book

isbn

str

Person

Authorship

Date str

[navigate⟩

int

⟨subtr]

* < 30 /Young

⟨select]

[compose⟩

dd

1 num Building Shelf

RBA Book

title isbn RBS

m

m n Author

RSB

m

name mm yy

bdate

tel

addr 1

The Legend: - keys are underlined; - dashed underline denotes a relative key

name

bdate

dd yy book author on

num

Book

title str

isbn str

Shelf str

Person

Building

Authorship

in

[cover] [key]

[key]

[inverse]

Date

int [key]

Addr addr

[contains] str

has

Figure 4: Dp-graphs vs. ER-diagrams

of the property [contains] is more intricate thanabove because it involves specifying behavior of func-tion [[ has ]]t when t changes; details can be found in[DK03].

Finally, to express the required semantics of theconfiguration Shelf - RSB - Building, we have intro-duced two functions, in and has, representing thesame relationship RSB . Hence, these functions aremutually inverse and we must declare this fact in outdp-graph: note label [inverse] hung on the respectivepair of arrows. The resulting dp-graph in Fig. 4(b)has the following two main features: it does possessa precise formal semantics (which is close to the in-tended semantics of the ER-diagram), yet it is syn-tactically transparent and similar to the ER-diagram.

by declaring the arrow in (inverse to has) to be single-valued.

4.2 Querying dp-graphs

A special subclass of diagram predicates are diagramoperations. For example, we may consider an oper-ation navigate: its input is a pair of functions witha common source (multi-relation), and the output isthe function of navigating the relation from its oneparticipant to the other. Syntactically, declaring suchoperation looks like shown in Fig. 4(c). Semanticallyit means that having a relation

([[Authorship ]]t,[[ book ]]t,[[ author ]]t)with two projection functions to the classes Book andAuthor, we navigate the relation and produce a new(in general, multi-valued) function

/bo-au : [[Book ]]t → [[Author ]]t.More accurately, /bo-au/ is the name of the new func-tion denoted in the dp-graph by a new arrow /bo-au,and the very new function is its extension [[ /bo-au ]]t.We denote derived elements by dashed blue lines andtheir names are prefixed with slash (the latter nota-tional tip is borrowed from UML).

Having a pair of functions with the target of thefirst being the source of the second, e.g., [[ /bo-au ]]t

and [[ name ]]t in Fig. 4, we can compose them andproduce a new function. Syntactically we denote thisnew function by an arrow /bo-au-name. Semantically,its extension [[ /bo-au-name ]]t is the result of execut-ing the operation compose. The right lower part ofthe diagram shows a query specification, whose exe-cution would produce a set of Persons younger than30.

Thus, queries against dp-graphs are diagram oper-ations: they input and output configurations of setsand mappings (specific and predefined for a specificoperations). Technical details of how syntax and se-mantics of such operations is formally defined can befound in [Dis96]. We emphasize that we do not pro-vide an a priori fixed query language. It is the user’sresponsibility to define her own query language in ac-cordance with dp-graphs’ syntax, and supply it withexecution procedures.

4.3 Universality statement.

Thesis. Let S be some data schema (relational,XML, ERD, yours favorite one). If semantics of S

13

Page 14: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

can be somehow formalized, then there is a dp-graph(graph with diagram predicates) G such that setsinst(()S) and inst(()G) are naturally isomorphic.

“Proof”/Justification. Let X be some notion,eg, semantics of S. Formalizability means that Xcan be expressed in some formal set theory devel-oped in modern mathemtics. There are a few suchtheories, which are roughly equivalent between them-selves. One of the most handy and convenient inapplications is the higher-order type theory, HOTT,presented, e.g., in [LS86]. Thus, we may define for-malizability of X as the possibility to present X as aHOTT-theory.

In part II of [LS86], the following results are care-fully proven. Any HOTT-theory generates a spe-cial algebraic structure called topos. Conversely, anytopos can be presented as a HOTT-theory. Moreover,the notions of HOTT-theory and topos are equivalentin some technical sense. Thus, we may say that for-malizability of X means the possibility to present Xalgebraically as a topos.Finally, any topos is noth-ing but a dp-graph in a specific signature of diagrampredicates (in fact, operations), see e.g. [FS90] for aprecise but “pictorial” definition. We conclude thatformalizability of X means that X can be encodedby a dp-graph.8

4.4 FAQ about dp-graphs

How natural are translations of local schemasin different data models to dp-graphs? If se-mantics of a local schema is well understood in pre-cise (better, formal) terms, translation to a dp-graphis easy because dp-graphs are nothing but specifica-tions of schema semantics.

How complex is the vocabulary of basic con-cepts for dp-graphs? Any dp-graph consists of ele-ments of only three types: nodes, arrows and diagrampredicates. In addition, some of the elements can bedeclared derived, which means that the correspond-ing diagram predicate is actually an operation.

8Speaking more accurately, we have mentioned three formalmodels, ie, three formal definitions, of the notion of formaliz-ability. What is mathematically proven is that they are allequivalent.

(a) building the graph of a function

(b) navigating a (multi)-relation

gR

X

Y

fR R ⟨navig][inv]

qf

[graph⟩

X

Y

f /Rf

pf

Figure 5: The representation conflict of dp-graphs

How complex are possible structural conflictbetween different dp-graphs representing thesame data? We will show in the next section howdifferent representation conflicts can be managed byqueries. Since the vocabulary of structural elementin dp-graphs consists of only nodes and arrows, thereis only one type of structural conflicts. To wit: dataseen as a set (node) by one dp-graph, is seen as afunction (arrow) in another dp-graph. This conflict isresolved by two queries/operations as shown in Fig. 5.

5 Schema matching as querydiscovery and equating

In this section we show that the diversity of conflictsbetween schemas can be reduced to basically the onlyone: a query against one schema and a query to an-other schema produce the same result (up to canonicisomorphism). We will call such statement a queryequation. The idea appeared in [CD96] and then in[MBHR05] but it seems its full potential is still notrealized. The most elaborated taxonomies of con-flicts reported in the literature are those in [SP92]and [PB03]. Since the latter claims that its taxon-omy and merge algorithm subsumes those in [SP92],we will carefully consider how conflicts considered in[PB03] can be described by query equations. To easereferences, we will call the taxonomy and merge pro-cedure specified in [PB03] Vanilla – the name of thetool implementing them. Also, in this section we useterms schema and graph interchangeably.

14

Page 15: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

5.1 Vanilla’s constructs via dp-graphs

5.1.1 Preliminaries.

In this section we apply general patterns for specify-ing inter-schema correspondences (section 3.2) in thedp-graph framework. For dp-graphs, the importantcondition of preserving the structure by inter-schemamappings (projections of correspondence spans andinjections of local schemas into the merged ones) isformulated as follows.

First of all, mappings must be compatible with in-cidence relations between nodes and arrows. Giventwo graphs G, H and a mapping f : G → H betweenthem, we require that if f(a) = b for arrows a ∈ Gand b ∈ H, then

(3) f(2a) = 2b and f(a2) = b2,

where 2∗ and ∗2 denote the source and the targetnodes of an arrow ∗. We will often call mappings be-tween graphs satisfying this requirement graph mor-phisms. To specify a graph morphism, it is sufficientto specify it for arrows and isolated nodes. For ex-ample, the only dotted curly arrow in the right halfof Table 2(a’) specifies morphism f02 as precisely asthree dotted arrows specify f01 on the left. The sec-ond condition requires preservation of diagram predi-cates. Given two dp-graphs G, H, a graph morphismf : G → H is called a dp-graph morphism, if for anygroup E of nodes and arrows in G labeled by pred-icate P , the image of this group in graph H, f(E)is also labeled by P . Below we will call morphismmappings as it is a suggestive and natural name forthem. We remind that the configuration

C12 =(S1

�f01S0

f02- S2

).

is called a correspondence span with schema S0 thehead and projection mappings fi the legs.9

The left column of Table 2 presents simple exam-ples (taken from [PB03]) illustrating Vanilla’s tax-onomy of conflicts. In the right column we spec-ify these examples with dp-graphs and mappings be-tween them. We will consider the table row by row.

9Warning: in the literature, particularly in Vanilla, spansare often called mappings while the projections are left name-less.

5.1.2 Synonymy

Conflict (a). It is a simple synonymy situation.Both specifications (a) and (a’) are practically equiva-lent yet two points of difference are worth mentioning.(i) For us, attributes are arrows targeted at primitivetypes and hence attribute domains become explicit.Since schema integration is all about semantics, ex-plicit specifications are preferable. (ii) An equationbetween elements is set by projection mappings andlabeling the corresponding element in schema S0 withequality symbol is not necessary. The head of thecorrespondence span is an ordinary schema like S1

or S2, in which there are neither equality nor similar-ity labels.Particularly, it allows us to write names ofcorrespondence nodes right on them, which is conve-nient and helpful in practical work.

Conflict (b). Here the situation is a bit morecomplicated. Attributes f-name, l-nameof the rightschema (S2) are absent in the left schema (S1) butcan be derived in it. This derivation can be speci-fied either in the left schema itself, or in the corre-spondence schema. We choose the latter way as itmakes comparison with Vanilla easier. Compositionof two evident queries q11 and q12 against the corre-spondence schema produces two new arrows /f-nameand /l-name/ as needed. Now correspondence be-tween the local schemas can be specified by settingprojection morphisms as shown.

Remark. All examples in Table 2 are very specialin the sense that the local schemas are subsumed bythe correspondence schema and hence the latter issimultaneously the merged schema as well. In moredetail, what the merge algorithm described in thenext section does is “absorbing” (without any dam-age to the structure!) those parts of local schemas,which are in the range of projection mappings, intothe correspondence schema . The name conflictsare uniformly resolved: the correspondence schemanames dominate. Since for the examples in the table,projections are surjections (cover their target), localschemas will be entirely absorbed. This considerationhelps to interpret the examples.

The result of merge in cell (b’) is thus a schemawhere the local schemas are embedded. In additionto images of the local schemas, the correspondence

15

Page 16: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

schema include also new information: specificationof query q1 = q11; q12, which relates data instances ofthe local schemas.

Conflict (c). Semantically, the situation is en-tirely similar to the above. Instances of the leftschema can be derived from those of the right oneby executing a query q2. We again have partiallydefined projection mappings, which will absorb thelocal schemas into the correspondence one. The onlydifference between (b’) and (c’) is that the formeruses query q1 to derive the right from the left, while(c’) uses query q2 to derive the left from the right.Moreover, it is easy to see that queries q1 and q2 aremutually inverse, hence the correspondence schemasin (b’) and (c’) are der-equivalent : they differ onlyin the choice of which elements are considered basicand which are derived.

A striking difference in Vanilla’s view of the twosituations is caused by replacing querying in (b’)by a new structural relationship, that is, as a newrather than derived information.10 Not only replac-ing queries by sub-structural relationships unjustlycomplicates the taxonomy and hence the merge al-gorithm, it may create problems later if the originaldata model does not have a similar structural rela-tionship. For example, SQL does not have a con-struct of sub-column. Vanilla calls such problemsmeta-model conflicts. Thus, Vanilla’s way of process-ing a typical representation conflict may create an-other type of conflict.

5.1.3 Homonymy

Non-mapping element (d). Semantics of the sit-uation is this. Schema matching investigation revealsthat left and right Persons refer to the same class butthe attributes bio, although having the same namein both schemas, actually refer to different concepts:official bio (o-bio) for the left and unofficial one (n-bio) for the right view. Furthermore, it is revealed

10As for similarity elements with expressions attached, cell(c), this is Vanilla’s way to manage query specifications. Btw,machinery of expressions is syntactically hard to manage. Itis unclear, for example, how to compose mappings carryingexpressions or reverse such mappings – this problem was ex-plicitly stated in [Ber03]. It can be resolved with dp-graphs asshown in [Dis05].

that a person actually has two bios but the left viewis not interested (does not know about?) n-bio whilethe right view does not concern about o-bio. (Thisis semantics of (d) as it is described in [PB03]). Thecorrespondence span in cell (d1’) directly specifies ex-actly this semantics: Persons are glued together whilebios are not. The fact of having exactly two bios canbe captured by requiring the functions [[ o-bio ]] and[[ n-bio ]] to be totally defined. Note that each elementof the correspondence schema is either in the domainof projection f01 or f02, that is, there are no non-mapping elements in the span. The following con-sideration explains what is meant by Vanilla’s classi-fication of case (d) as non-mapping element. Let usperform merge/join of the span in (d1’). The result isshown in (d2’) with black solid lines. Now we can ap-ply the operation of pairing to the two functions, andobtain a new derived function allBio. For any personP , allBio(P ) def= (o-bio(P ), n-bio(P )) ∈ txt× txt, thatis, a pair of texts with first component giving the of-ficial, and the second the non-official bio. It appearsthat Vanilla implicitly introduced into correspon-dence specification a part of the future merge! Tofinish the case, diagram (d3’) shows a der-equivalentrestructuring of the merge, where the attribute allBiois basic while the component bios are derived. (Der-equivalence is implied by the fact that operations ofpairing and projecting are mutually inverse).

Conflict (e). We again have a case of homonymy.Zip elements of the views are declared equal but theirtypes are different. Hence, after the merge, elementZip will have two types, which violates the integrityconstraint of the model itself: types must be unique.Vanilla calls such conflicts fundamental and pays spe-cial attention to them. The way of resolving this typeof fundamental conflicts in Vanilla is borrowed from[BDK92] and is shown in the lower half of the dia-gram (e). The intersection of the types is formed andis considered to be the new unique type. There are afew reasons to consider this solution unsatisfactory.First, it is not clear what is to do with Zip-valuesoutside the intersection. If there are no such values,then the views actually use the same type int-and-str, which can be matched from the very beginning.If such values exist, then the conflict is not resolved.

16

Page 17: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Finally, there are possible other operations on typescoercing/unifying them and the choice of intersec-tion only is not justified. For example, diagram (e1’)presents another view of the conflict in the dp-graphframework.

First of all we note that when attributes are seenas arrows, it is illegal to equate two zips with differenttarget nodes: we remind that equality of two arrowsautomatically means equality of their sources and tar-gets.11 Thus zip-arrows cannot be equated. Yet wecan equate their sources thus coming to situation in(d1’). It would mean that an address has two zips,one of type int and the other of type str. That wouldbe an interesting address system but, anyway, if se-mantics of the case is such, we specify it and merge asshown in cells (d1’)..(d3’). However, for address sys-tems, more likely is the situation when different ziptypes mean that we deal with different address sys-tems, e.g., in US and Canada. The correspondencespan specifying this semantics is shown in cell (e1’).The only fact this correspondence spec reveals is thatAddr and zip are homonyms and fixes the problem byrenaming the Addr nodes (assuming that the arrownames are qualified by the source names, us-zip, ca-zip). Merge is fairly simple as shown in cell (e2’), ig-nore the blue-dashed part of the graph for a moment.Since two address systems have much in common, itis reasonable to generalize the classes as shown in(e2’) (the upper operation [gen], double-body arrowsdenote subclassing). In the simple set-theoretical se-mantics we consider, the operation is nothing but thedisjoint set union).12. Similarly, we can generalizethe domains into a new domain int ∪ str (in pro-gramming terms, form a variant type). Finally, wedefine a new generalized attribute /zipin the evidentway: it is the int-valued zip (left view) for USAddrobjects in /Addr and the str-valued zip (right view)for CaAddr-objects. Da-graph (e3’) presents a der-equivalent restructuring of the merge, in which gen-eralized Address and zip are basic while their US andCa versions are derived. (Der-equivalence is followed

11In the model management tool we are implementing[SCE+07], such declaration would be a compile time ratherthan merge time error.

12If the sets were not disjoint, we would specify this in thecorrespondence span

from the fact that operations [gen] and [select] aremutually inverse).

We see that the situation described in cell (e) be-fore the merge is significantly underspecified. It maybe specified either by the correspondence span likein (d1’) (an address has two zips) or by the span(e1’) (an address has one zip that is either int orstr) or as in (e) (assumed by default in Vanilla).Note that for either-or semantics takes place, it isnot necessary that the sources of the attributes weredifferent. We can imagine a single address systembut so freely designed that some addresses have int-zips and others str-zips.Correspondingly, we have dif-ferent merged schemas. Neither of merges is rightor wrong: there are different semantics and respec-tively there are different merges adequate to them (cf.[MIR93]). Importantly, merges are different becausethe correspondence specifications for these semanticsare different, not because the operation of merge isnon-deterministic.

5.2 More on conflicts

Queries vs. constraints. Fig. 6(a) shows a sim-ple conflict resolved by a simple query. However, thiswould work only if the following two conditions hold.Firstly, we need to know that the class Person in theleft view is exactly the class right Person whose ageis less than 30. Secondly, the query language mustbe expressive enough to specify this knowledge in anadequate way. If one of this components is absent,but still we know that the left Person is a subclassof the right Person, we introduce into the correspon-dence schema the subclassing constraint. This con-straint presents a new piece of information capturedin neither of local schemas and neither of the projec-tion mappings is defined on it. In Vanilla’s terms,this constraint is a non-mapping element. There arealso situations when non-mapping elements are dataelements (nodes or arrows) rather than constraints.

Books and Authors continued. It is easy to seethat with query operations of navig or/and graphFig. 5 and function composition = , conflicts be-tween views depicted in Fig. 2 can be specified byequations and thus resolved.

The examples we considered show the flexibility

17

Page 18: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

(a) query language is expressive enough

(a) query language is not expressive

f01

age Corr. schema S0

age

Person

int

f02 Young Person Person

int

age age f01

Sch. S1

/age

/Young Person

Sch. S2Corr. schema S0

<30 int

Person

ageint⟨select]

Person

int

f02

Figure 6: Queries vs. constraints

and convenience of the Query Discovery and Equat-ing (QDE) approach in the framework of dp-graphs.Moreover, based on the Universality statement (sec-tion 2.3), we can say that any overlapping betweenviews, which can be formally specified, can be speci-fied by QDE in a suitable signature of dp-graphs.

6 Conclusion

The goal of the paper was to analyze the problem ofschema integration in precise conceptual and techni-cal terms. To achieve it, we have applied a mathe-matically sound and powerful semantic model basedon graphs with diagram predicates (constraints) anddiagram operations (queries), dp-graphs in short.We have carefully analyzed major research workson schema integration, reformulated them in preciseterms of dp-graphs, removed inconsistencies and re-dundancies within and between the approaches, andintegrate them into, we believe, a coherent frame-work. Its main highlights are as follows.

Schema integration consists of three major phases:schema matching, merging and normalizing. Match-ing is the key to the entire problem because of theinfamous problem of specifying overlaping/conflictsbetween views in a manageable way. We have shownthat all types of conflicts identified in the litera-ture can be reduced to the following one universaltype: elements basic in one view are derived in an-other view and can be computed if the query lan-gauge is rich enough. Thus, specifying inter-schemacorrespondences is nothing but query discovery and

equating. This formulation gives us a universal yetcompact specification pattern, which schema match-ing must fill in, with a clear semantic meaning. Inits turn, a syntactically and semantically transpar-ent pattern for inter-schema correspondences givesrise to a syntactically and semantically clear algo-rithm for schema merging. Particularly, we give adeclarative syntactic definition of schema merging asa graph-based analog of the lattice-theoretic notionof the least upper bound. In a companion paper,we give a declarative semantic definition and showthat these two definitions are equivalent. Also, hav-ing only one type of conflicts between schemas allowsus to use a purely algebraic and in fact very simple al-gorithm for schema merging. However, since the cor-respondence specification involves derived elementsand queries against local schemas, the merged schemaalso contains derived elements and hence redundan-cies. It implies a post-merge phase (normalization)aimed at removing redundancies as much as possi-ble. Normalization can be non-trivial; particularly,we have presented an example showing that completeremoval of derived elements from the merged schemais not always possible.

To summarize, among the three main steps of inte-gration (match, merge, normalize), the first and thethird are heuristic and may be really complex (espe-cially match) while the very merge is pure algebraand simple. We suppose that excessive complexity ofmerge algorithms proposed in the literature is causedby partial absorbtion of underspecified overlappingand implicit normalization into the very merge pro-cedure. In contrast, in our approach the three phasesare explicitly separated and provided with clear spec-ification patterns ensuring their smooth sequentialcomposition. Importantly, clear separation of the en-tire problem into three well-shaped subproblems al-lows one to address each of them in a appropriateway and use the appropriate tools.

References

[AB01] S. Alagic and P. Bernstein. A model the-ory for generic schema management. InProc. DBPL’2001, 2001.

18

Page 19: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

[AH88] S. Abiteboul and R. Hull. RestructuringHierarchical Database Objects. Theoret-ical Computer Science, 62:3–38, 1988.

[BC86] J. Biscup and B. Convent. A formal viewintegration method. In ACM SIGMODConf.on Managment of Data, pages 398–407, 1986.

[BDK92] P. Buneman, S. Davidson, and A. Kosky.Theoretical aspects of schema merging.In Advances in Database Technology -EDBT’92, Springer LNCS # 580, 1992.

[Ber03] P. Bernstein. Applying model manage-ment to classical metadata problems. InProc. CIDR’2003, pages 209–220, 2003.

[BLN86] C. Batini, M. Lenzerini, and S. Navathe.A comparitive analysis of methodologiesfor database schema integration. ACMComputing Surveys, 18(4):323–364, 1986.

[BW95] M. Barr and C. Wells. Category The-ory for Computing Science. Prentice Hall,1995.

[CD96] B. Cadish and Z. Diskin. Heteroge-nious view integration via sketches andequations. In Foundations of IntelligentSystems, 9th Int.Symposium, ISMIS ’96,Springer LNAI #1079, pages 603–612,1996.

[Con86] B. Convent. Unsolvable problems relatedto the view integration. In Int.Conf.onDatabase Theory, Roma, pages 141–156,1986.

[DH84] U. Dayal and H. Hwang. View definitionand generalization for database integra-tion of a multibase system. IEEE Trans.Software Eng., 10(6):628–644, 1984.

[Dis96] Z. Diskin. Databases as diagram al-gebras: Specifying queries and viewsvia the graph-based logic of skethes.Technical Report 9602, Frame In-form Systems, Riga, Latvia, 1996.

www.cs.toronto.edu/ zdiskin/Pubs/TR-9602.pdf.

[Dis05] Z. Diskin. Mathemtics of generic speci-fications for model management. In En-cyclopedia of Database Technologies andApplications, pages 351–366. Idea Group,2005.

[Dis06] Z. Diskin. Metamodel-independentschema & data integration: Towardsjoining syntax and semantics in genericmodel management. Technical Report2006-522, School of Computing, Queen’sUniversity, Kingston, Canada, 2006.http://www.cs.queensu.ca/TechReports/.

[DK03] Z. Diskin and B. Kadish. Variable setsemantics for keyed generalized sketches:Formal semantics for object identity andabstract syntax for conceptual modeling.Data & Knowledge Engineering, 47:1–59,2003.

[FKMP05] R. Fagin, P. Kolaitis, R. Miller, andL. Popa. Data exchange: semantics andquery answering. Theoretical ComputerScience, 336(1), 2005.

[FS90] P. Freyd and A. Scedrov. Categories, Al-legories. Elsevier Sciece Publishers, 1990.

[Len02] M. Lenzerini. Data integration: A theo-retical perspective. (Invited tutorial). In21st ACM Symposium on Principles ofdatabase systems, pages 233–246, 2002.

[LS86] J. Lambek and P. Scott. Introduction tohigher order categorical logic. CambridgeUniversity Press, 1986.

[LSDR07] Y. Lee, M. Sayyadian, A. Doan, andA. Rosenthal. eTuner: Tuning SchemaMatching Software Using Synthetic Sce-narios. The VLDB Journal, 16(1):97–122,2007.

19

Page 20: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

[MBHR05] S. Melnik, P. Bernstein, A. Halevy, andE. Rahm. Supporting executable map-pings in model management. In Proc.SIGMOD’05. ACM Press, 2005.

[MBR01] J. Madhavan, P. Bernstein, and E. Rahm.Generic schema matching with Cupid. InProc. of the Int’l Conf. on Very LargeData Bases (VLDB), 2001.

[MHH00] R. J. Miller, L. M. Haas, andM. Hernandez. Schema Mapping asQuery Discovery. In Proc. VLDB, pages77–88, 2000.

[MIR93] R. Miller, Y. Ioannidis, and R. Ramakr-ishnan. The use of information capacityin schema integration and translation. InProc. Very Large Data Bases, pages 120–133, 1993.

[Mot87] A. Motro. Superviews: Virtual integra-tion of multiple databases. IEEE TOSE,13(7):785–798, 1987.

[PB03] R. Pottinger and P. Bernstein. Mergingmodels based on given correspondences.In Proc. of the Int’l Conf. on Very LargeData Bases (VLDB), 2003.

[PVM+02] L. Popa, Y. Velegrakis, R. J. Miller, M. A.Hernandez, and R. Fagin. TranslatingWeb Data. In Proc. VLDB, pages 598–609, 2002.

[RB01] E. Rahm and P. Bernstein. A survey ofapproaches to automatic schema match-ing. The VLDB Journal, 10(4):334–350,2001.

[SCE+07] Rick Salay, Marsha Chechik, Steve M.Easterbrook, Zinovy Diskin, Pete Mc-Cormick, Shiva Nejati, Mehrdad Sabet-zadeh, and Petcharat Viriyakattiyaporn.An eclipse-based tool framework for soft-ware model management. In ETX, pages55–59, 2007.

[SE05] M. Sabetzadeh and S. Easterbrook. Analgebraic framework for merging incom-plete and inconsistent views. In 13thInt.Conference on Requirement Engineer-ing, 2005.

[Shi81] D. Shipman. The functional data modeland the data language DAPLEX. ACMTODS, 6(1):140–173, 1981.

[SL90] A. Sneth and C. Larson. Federated data-base systems for managing distributed,heterogeneous, and autonomous data-bases. ACM Computing Surveys, 1990.

[SP92] S. Spaccapietra and C. Parent. View in-tegration: a step forward in solving struc-tural conflicts. IEEE Transactions onKDE, 1992.

[SPD92] S. Spaccapietra, C. Parent, andY. Dupont. Model-independent as-sertions for integration of heterogeneousschemas. The VLDB Journal, 1(1), 1992.

20

Page 21: Integrating schema integration frameworks, algebraicallyftp.cs.toronto.edu/pub/reports/csrg/583/TR-583-schemaIntegr.pdf · Integrating schema integration frameworks, algebraically∗

Taxonomy of conflicts in Vanilla [PB03]

… and their algebraic query-based representation (Schema matching as Query Discovery)

(a) Representation (R)-conflict of type I: resolved by equality mapping elements

(EMEs)

(a’) Properties of the Corr-span: (i) Corr-head (schema S0 ) does not contain derived elements, (ii) Corr. legs f01, f02 are totally defined functions

(b) R-conflict of type II: resolved by EMEs and sub-element relationship

(b’) Properties of the Corr-span: (i) Corr-head does contain derived elements (produced by query q1 = q11° q12) (ii) Corr. legs f01, f02 are partially defined functions but (iii) For each basic element of the head, either f01 or f02 is defined.

(c) R-conflict of type III: resolved by

SMEs and Expression property (c’) The same as b’. The only difference is that a different query is used to match schemas

(d1’) match

(d) R-conflict: Matching schemas with a

non-mapping element allBio (d2’) merge with some derived info added

(pairing arrows o-bio and n-bio) (d3’) restructuring the merge to a der-

equivalent form

(e1’) match

(e) Fundamental (F)-conflict (e2’) merge with some derived info added (e3’) restructuring the merge to a der-

equivalent form

/zip2

/USAddr

zip Addr

int zip

Addr

str

Addr

int str

int ∪ str

zip

/CaAddr

/zip1 [select1⟩⟨select2]

zip zip

Addr

int zip

Addr

str

CaAddr

int zip

str

USAddr f01 f02Zip-code

int

Zip-code

str

int and str

=

[merge]

Zip-code

int str zip

USAddr CaAddr

/Addr

[gen]

int

zip str

int ∪ str

[gen] /zip

v2m’

/n-bio /o-biol allBio

bio txt

bio Person

txt

txt × txt

[ = ⟩⟨ = ]

txt

Person

txt

v ’ 1m

Person

Person

bio

Person

bio allBio

= Official Bio

= Non-official

Bio

=

n-bio o-bio bio txt

Person bio

Person Person f01f02

txt txt txt

n-bio o-bio Person

txt txt /allBio

txt × txt

f-name l-name

Person

str ⟨q2 (concat)]/name name str

Person

l-name

f-name str

Person

f01

f02

Person

name

Empl

f-name =

sim l-name name =

concat(f-name,l-name)

f01l-name

/2-proj

Person

name

sepStr

[q11⟩ /name /1-proj

/f-name /l-name name

Person f-name

str

Person

f02

[q12⟩

str str str

str

Person

name

Empl

f-name =

= l-name

= =

f01 Sch. S1

pin int

Person idn

Employee

eid

Empl

Sch. S2 Corr. schema S0

f02int int

Person

pin

Empl

eid

Sch. S2 Sch. S1

= =

Map M12

v2m v1m

Table 2: Conflicts via query discovery21