data exchange: semantics and query answering

33
06/16/22 1 Data Exchange: Semantics and Query Answering Ronald Fagin -- IBM Almaden Research Center Phokion G. Kolaitis -- UC Santa Cruz Renee J. Miller -- University of Toronto Lucian Popa -- IBM Almaden Research Center IBM Almaden - November 12, 2002 (To appear in ICDT 2003)

Upload: sine

Post on 25-Feb-2016

61 views

Category:

Documents


1 download

DESCRIPTION

Data Exchange: Semantics and Query Answering. Ronald Fagin -- IBM Almaden Research Center Phokion G. Kolaitis -- UC Santa Cruz Renee J. Miller -- University of Toronto Lucian Popa -- IBM Almaden Research Center. IBM Almaden - November 12, 2002 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Exchange: Semantics and Query Answering

04/22/23 1

Data Exchange: Semantics and Query Answering

Ronald Fagin -- IBM Almaden Research CenterPhokion G. Kolaitis -- UC Santa Cruz Renee J. Miller -- University of Toronto Lucian Popa -- IBM Almaden Research Center

IBM Almaden - November 12, 2002

(To appear in ICDT 2003)

Page 2: Data Exchange: Semantics and Query Answering

04/22/23 2

Motivation and Overview Data exchange problem:

– How to restructure data from a source schema to a target schema, according to a given specification

Main motivation for this work: – Understanding of fundamental issues that lie underneath data exchange

systems such as EXPRESS and Clio Main challenge:

– Inherent under-specification: • Specification (as we shall see) must be simple and intuitive, but• There are many ways in which the restructuring can be performed !

– Question: did we make the right choice in the design of Clio ? Our approach (only relational case, so far):

– Define and study universal solutions • Show this is the “best” way of performing data exchange• Study computational aspects• Study what happens after data exchange: query answering

Page 3: Data Exchange: Semantics and Query Answering

04/22/23 3

The Data Exchange Problem

Assume a data exchange setting: • source schema S, • target schema T with a set t dependencies (see next)• set st source-to-target dependencies (see next)

The data exchange problem is the following: Input:

• source instance IOutput:

• target instance J such that: <I, J> st and J t (call such J a solution for I )

Source schema S

Target schema

Tst

I J

t

Page 4: Data Exchange: Semantics and Query Answering

04/22/23 4

For most practical purposes, st contains:– source-to-target tuple-generating dependencies (tgds) : S(x) y T(x, y)e.g. DeptEmp(did, mgr_name, eid) M. Dept (did, M, mgr_name) Emp (eid, did)(Move data from source table DeptEmp into two target tables,

Dept and Emp. The existential variable M is an “unspecified” manager id)

Source-to-target Dependencies

DeptEmp did mgr_name eid

Dept did mgr_id mgr_nameEmp eid did

Page 5: Data Exchange: Semantics and Query Answering

04/22/23 5

The second, equally important, part of the specification, are the target dependencies t: – tgds : T(x) y T(x, y)

e.g. Dept (did, mgr_id, mgr_name) D. Emp (mgr_id, D) (A foreign key constraint in the target)

– equality generating dependencies (egds):

T(x) (x1=x2) e.g. Emp (e, d1) Emp (e, d2) (d1 = d2)

(A target key constraint)

Target Dependencies

Dept did mgr_id mgr_nameEmp eid did

Page 6: Data Exchange: Semantics and Query Answering

04/22/23 6

Questions (To be Answered Next)

When more than one solution exists, how do we choose a “best” one ?

How do we compute a “best” solution ? Is there always a solution ? Is there always a

“best” solution ? How does query answering on the chosen

solution behave ?

Page 7: Data Exchange: Semantics and Query Answering

04/22/23 7

Universal Solutions

=“Best” Solutions

Page 8: Data Exchange: Semantics and Query Answering

04/22/23 8

Existence of Multiple Solutions

There may be many solutions for the target instance (J, J1, J2, etc.) However, J seems to be more general:

– there exist homomorphisms h1: J J1 and h2: J J2 (see definition next)

– but none from J1 or J2 to J – intuitively, J1 and J2 have extra information

ABC

P

ABC

Q

ABC

R

<a0 b’0 c’0>

<a’’0 b0 c’’0>

<a’’’0 b’’’0 c0>

ABC

T

P(a,b,c) YZ. T(a,Y,Z)

Q(a,b,c) XU. T(X,b,U)

R(a,b,c) VW. T(V,W,c)

source target

a0 b0 c0

J1T

a0 Y0 Z0 X0 b0 U0

V0 W0 c0

JT

a0 b0 Z1 V1 W1 c0

TJ2

h1 = {Y0 -> b0, Z0 -> c0, … }h2

X0 , Y0 , Z0 … represent

“unknown” values (or “nulls”)

Page 9: Data Exchange: Semantics and Query Answering

04/22/23 9

Homomorphisms As we have seen, the values of a target instance can be either:

• constants (i.e. values coming from the source instance), or • nulls (unknown values)

Definition. Assume J1 and J2 are such target instances. A homomorphism h: J1 -> J2 is a mapping from values of J1 to values of J2 such that:

– h(c) = c, for constants c (nulls of J1 can be mapped to any values of J2)– for every tuple <a1, …, an> in relation T of instance J1:

< h(a1), … h(an) > must be a tuple in relation T of instance J2 Example:

a0 b0 c0

J1T

a0 Y0 Z0 X0 b0 U0

V0 W0 c0

JT

h1 = {Y0 -> b0, Z0 -> c0, … }

Page 10: Data Exchange: Semantics and Query Answering

04/22/23 10

Universal SolutionDefinition. Assume a data exchange setting (S, T, st, t). Given

source instance I, a universal solution for I is a target instance J such that:

(1) J is a solution for I (2) for every solution J’ for I, there exists homomorphism h: J J’

For the previous example, J is a universal solution. J1 and J2 are not.

Among all solutions, universal solutions are special:

• They contain no more and no less than the amount of information given by the specification

a0 Y0 Z0 X0 b0 U0

V0 W0 c0

JT

Page 11: Data Exchange: Semantics and Query Answering

04/22/23 11

Fact: – Uniqueness up to homomorphic equivalence:

• If J1 and J2 are universal for I then there are homomorphisms between J1 and J2 in both directions

– Representation of the space of solutions: • Sol(I1) = Sol(I2) iff J1 and J2 are homomorphically equivalent

We adopt the universal solution as the notion of “best” solution.

Later we will see another justification for universal solutions in terms of query answering.

Page 12: Data Exchange: Semantics and Query Answering

04/22/23 12

When do universal solutions exist ?

How do we compute a universal solution ?

Page 13: Data Exchange: Semantics and Query Answering

04/22/23 13

Computing Universal Solutions

Page 14: Data Exchange: Semantics and Query Answering

04/22/23 14

Chase We canonically generate a universal solution by using the chase:

– Given source instance I, start with an empty target instance J– Generate tuples in J by applying the dependencies in st and t.

Example:

DeptEmp did mgr_name eid

Dept did mgr_id mgr_nameEmp eid did

CS Mary E003

I

DeptEmp(did, mgr_name, eid) -> M. Dept (did, M, mgr_name) Emp (eid, did)

st :

Dept (did, mgr_id, mgr_name) -> D. Emp (mgr_id, D)

t :

J

< CS M0 Mary >

< E003 CS >

Added in a first chase step(M0 is a null)

< M0 D0 >

Added in a second chase step

Page 15: Data Exchange: Semantics and Query Answering

04/22/23 15

This process is repeatedly applied:– for all the source tuples and for the generated tuples,– as long as there are dependencies that are not yet satisfied

The chase may be infinite (cyclic t ) … … or it may fail (e.g. target key constraints that are not

possible to satisfy for the given source data)– (details in the paper)

However, if the chase successfully terminates, the resulting target instance is a solution.

Page 16: Data Exchange: Semantics and Query Answering

04/22/23 16

Canonical Generation of Universal Solutions

Thus, the chase is a procedure for computing universal solutions, provided that: – Solutions exist, and– The chase is finite

We call universal solutions computed by the chase canonical universal solutions

Theorem. Assume a data exchange setting (S, T, st, t). Given source instance I:

– If the chase is finite and successful then its result is a universal solution.

– If the chase fails then there is no solution.

When can we guarantee that the chase is finite ?

Page 17: Data Exchange: Semantics and Query Answering

04/22/23 17

Weakly Acyclic Sets of Dependencies

Some cyclic sets of dependencies may cause infinite chases – In such case no universal solution may exist, and the semantics

of the data exchange is undefined Still there are cyclic sets of dependencies that behave well

and are quite useful

Weakly acyclic sets of dependencies (defined in the paper): – Cover many practical cases of target constraints– Allow for restricted cyclicity– The chase is guaranteed to be finite

Page 18: Data Exchange: Semantics and Query Answering

04/22/23 18

Polynomial-Time Chase

Theorem. Let be a weakly acyclic set of dependencies. For every instance K, the chase of K with can be computed in polynomial time.

Corollary. Assume a data exchange setting (S, T, st, t) such that t is a weakly acyclic set of dependencies.

– For every source instance I, the existence of a solution can be checked in polynomial time

– For every source instance I, if a solution exists then a universal solution can be produced in polynomial time.

Page 19: Data Exchange: Semantics and Query Answering

04/22/23 19

Next: what happens after data exchange ?

In particular, how is subsequent query answering affected by our choice of a solution (universal solution) ?

Page 20: Data Exchange: Semantics and Query Answering

04/22/23 20

Query Answering

Page 21: Data Exchange: Semantics and Query Answering

04/22/23 21

Query Evaluation on a Solution

Assume a fixed data exchange setting with a source instance I. Suppose that a System 1 chooses a solution J for data exchange.

A query q can now be asked against the target. – The evaluation of q, in System 1, is q(J). – However, a System 2 materializing a different solution J’ may give a

different evaluation q(J’). Different choices of J (for the same I) imply possibly different

query evaluations. Is there a notion of the “right” set of answers to q with respect to I

?

Source schema S

Target schema

Tst

I J

t

q

Page 22: Data Exchange: Semantics and Query Answering

04/22/23 22

Certain Answers We will use a notion that has been around in the context

of data integration and incomplete databases, where queries are asked against a set of possible databases.

Thus, t is certain if it is in the answer of q on every solution.

The certain answers provide well-defined semantics to query answering because they are independent of the choice of a solution.

Definition. Given I and q, a tuple t is a certain answer if: t q(J), for every solution J Notation: certain(q, I) = the set of all certain answers

Page 23: Data Exchange: Semantics and Query Answering

04/22/23 23

Can we compute the certain answers based just on our chosen (universal) solution ?

Page 24: Data Exchange: Semantics and Query Answering

04/22/23 24

Positive Queries

Thus, the certain answers of positive queries can be computed by evaluating them on any universal solution.

Moreover, this property characterizes universal solutions.

Proposition. Assume a data exchange setting (S, T, st, t) and a source instance I.

1. Let q be a positive query. If J is a universal solution, then certain(q, I) = q(J) .

2. Let J be a solution such that for every positive query q we have that certain(q, I) = q(J) . Then J is a universal solution.

Note: In the above: – Positive query means union of SPJ queries– q(J) means evaluate q on J and then throw away tuples that

contain nulls)

Page 25: Data Exchange: Semantics and Query Answering

04/22/23 25

Conjunctive Queries with Inequalities

a0 X0

Z0 a0

J (universal)

a0 a0 J2 (not universal)

AB

R

AB

S

<a0, b0>

<a1, a0>

AB

T

d1: R(a,b) X. T(a,X)d2: S(a,b) Z. T(Z,b)

d1

d2

q(u, v) :- xz. T(u, x) T(z, v) xz

It can be verified that:– <a0,a0> q(J), but <a0,a0> q(J2) (thus, not a certain answer).

Hence certain(q, I) q(J)

The universal solution gives extra answers

The situation changes when negation is involved (even in the very simple form of ).

Example:

Page 26: Data Exchange: Semantics and Query Answering

04/22/23 26

For conjunctive queries with inequalities, we have seen that simple query evaluation on a universal solution is not enough for computing the certain answers.

Question: Can we find a different SQL query q* such that when evaluated on a universal solution gives the set of certain answers of q ?

There are examples for which such query q* exists.

However, we show next that the answer is “no”, in general.

Page 27: Data Exchange: Semantics and Query Answering

04/22/23 27

Complexity: Two or More Inequalities

[AD98] proved a similar result for the case of conjunctive queries with six or more inequalities.

The coNP-hardness implies:– the certain answers cannot be computed by evaluating the

query q (or any other SQL query q*) on a polynomial-time generated universal solution (unless P = NP).

Theorem. Computing the certain answers of unions of conjunctive queries with at most two inequalities per term is coNP-hard, even in a restricted data exchange setting (LAV).

Page 28: Data Exchange: Semantics and Query Answering

04/22/23 28

Complexity: One InequalityTheorem. Assume a data exchange setting (S, T, st, t) such

that t is a weakly acyclic set of dependencies. Let q be a union of conjunctive queries with at most one

inequality per term. Let I be a source instance and let J be an arbitrary universal solution for I.

Then there exists a polynomial-time algorithm with input J that computes certain(q, I).

Thus, computing the certain answers for such queries is a tractable problem. – Moreover, this computation can take place on any universal solution.– The universal solution has all the information needed to compute the certain answers.

We show next that the problem of computing the certain answers, even for this tractable case, cannot be solved by means of SQL query evaluation.

Page 29: Data Exchange: Semantics and Query Answering

04/22/23 29

First-Order Inexpressibility

Theorem. There exists a data exchange setting and a boolean conjunctive query q with one inequality, for which there is no first-order query q* over the canonical universal solution such that certain(q, I) = q*(J) .

This is a strong inexpressibility result that shows that in data exchange we cannot use the notion of certain answers for answering queries with inequalities.

(In practice, instead of going for certain answers, we should just use query evaluation on the universal solution)

The proof uses an original combination of finite model theory techniques and the chase.

Page 30: Data Exchange: Semantics and Query Answering

04/22/23 30

Conclusions Universal solutions are a good candidate for using in data

exchange Clio produces such universal solution (in the relational

case) All universal solutions are equally good for answering

positive queries. – Simple query evaluation has the same semantics as that of the

certain answers.

For queries with inequalities, different universal solutions may give different query evaluations which may yet be different from the certain answers. – There is no hope to find the certain answers by means of SQL

query evaluation on a universal solution

Page 31: Data Exchange: Semantics and Query Answering

04/22/23 31

Future Work

Among all universal solutions, is there a universal solution that approximates, in a best way, the certain answers ? If yes, can this be computed efficiently ?

Extension to semantics and query answering for data exchange in the nested (XML) case.

Page 32: Data Exchange: Semantics and Query Answering

04/22/23 32

Page 33: Data Exchange: Semantics and Query Answering

04/22/23 33

Source-to-target tgds are the same formalism that Clio uses internally (for the relational case): st is generated by Clio in the semantic translation phase [VLDB02]

from correspondences. • User input

– Then Clio generates, based on st , a set of queries in the data translation phase [VLDB02]

– These queries compute a solution – Is Clio’s solution a good one ? (Since other solutions are also possible)

Here we try to understand, formally, the concept of “good” solutions

t is more general than the target constraints that Clio can currently handle.

Relationship to Clio