answering queries across mappings
DESCRIPTION
Answering queries across mappings. Grigoris Karvounarakis University of Pennsylvania. WPE-II Presentation. Global mediated schema (virtual). Query Q. T. Mappings. M 1. M 2. M n. Data integration. Heterogeneous data sources. S 1. S 2. S n. I n. I 2. I 1. J. - PowerPoint PPT PresentationTRANSCRIPT
1
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Answering queries across mappingsGrigoris Karvounarakis
University of Pennsylvania
WPE-II Presentation
WPE-II 2
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Data integration
I1 I2 In
S2 Sn S1
Heterogeneous data sources
Global mediated schema (virtual)
T
...
...
Query Q
Mappings M1 M2 Mn
WPE-II 3
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Data exchange
I
S
Source Target
TM
T
J
J is a data exchange solution if: hI,Ji ² M J ² T
WPE-II 4
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Query answering (basic problem setting)
I
S
Source Target
TM
Query Q
Given source and target schemas (S, T), mapping M, source instance(s) I and a query QT (over the target), evaluate Q (using data from I)
Query reformulation: Compute a reformulation Q’ of Q that only refers to source relations
Data exchange: Compute a data exchange solution J, such that Q can be evaluated directly on J
WPE-II 5
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Outline
Preliminaries Mapping languages Semantics of query answering
Query reformulation Query answering using data exchange Comparison
WPE-II 6
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Mapping languages
Two approaches: Containment between conjunctive queries Dependencies (logical assertions)
WPE-II 7
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Query containment
Definition: A query Q1 is contained in a query Q2, denoted by Q1 v Q2, if for all database instances I:
Q1(I) µ Q2(I). Two queries Q1 and Q2 are equivalent, if Q1 v Q2
and Q2 v Q1.
In the case where Q1 and Q2 are over different schemas, related through mapping M:
M ² Q1 v Q2 if 8I,J: hI,Ji ² M:
Q1(I) µ Q2(J)
WPE-II 8
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Containment mappings
General form (GLAV): QS(x,y) v QT(x,z) (sound – Open World Assumption) QS(x,y) ´ QT(x,z) (exact – Closed World Assumption) QS, QT are conjunctions of relational atoms over S,T resp.
Special cases: GAV (global-as-view): target is specified as a view of the source(s)QS(x,y) v T(x) (sound – OWA)QS(x,y) ´ T(x) (exact – CWA)
LAV (local-as-view): sources are specified as views of the virtual mediated schema S(x) v QT(x,y) (sound – OWA) S(x) ´ QT(x,y) (exact – CWA)
WPE-II 9
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Dependencies
Tuple-generating dependencies (tgds): 8x,z (x,z) y (x, y)(where , are conjunctions of relational atoms and
x,y,z are vectors of variables)
Equality-generating dependencies (egds): 8x (x) xi = xj
WPE-II 10
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Data exchange schema mappings
Source-to-target tgds: 8x,z (x,z) y (x, y)
is a conjunction of atoms over S and is a conjunction of atoms over T
Target tgds Both , are conjunctions of atoms over T
Target egds 8x (x) xi = xj
is a conjunction of atoms over T
WPE-II 11
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05Containment mappings vs. source-to-target
tgds A source-to-target tgd of the form: 8x,z QS(x,z) y QT(x, y)is equivalent to the sound GLAV mapping: QS(x,z) v QT(x, y) Sound GAV and LAV mappings can also be
expressed by source-to-target tgds. But exact mappings also include a target-to-
source direction: E.g.: S(x,z) ´ T1(x,y), T2(y,z) is equivalent to:
8x,z S(x,z) y T1(x, y) Æ T2(y,z) (source-to-target) and
8x,y,z T1(x, y) Æ T2(y,z) S(x,z) (target-to-source)
WPE-II 12
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Incompleteness
Mappings do not specify target instance completely E.g.: 8x,z S(x,z) ! 9y T(x,y) ÆT(y,z) does not specify the
values of y
I
S
Source Target
TM
J2
J1
J3
. . .
E.g., if I = {S(a,b)}:J1 = {T(a,a),T(a,b)}
J2 = {T(a,b),T(b,b)}
J3 = {T(a,X),T(X,b)}
J4 = {T(a,X),T(X,b),
T(a,Y),T(Y,b)} . . .
WPE-II 13
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Semantics of query answering
What do we expect as answers to queries over the target schema?
“Possible worlds” semantics: for every instance I of S, consider all possible instances J of the target schema T such that hI,Ji ² M
Convention: certain answers certainM,I(QT) = J: hI,Ji ² M QT (J)
WPE-II 14
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Outline
Preliminaries Mapping languages Semantics of query answering
Query reformulation Query answering using data exchange Comparison
WPE-II 15
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Equivalent reformulation
Definition: Q’S is an equivalent reformulation of QT across M (denoted M ² QT ´ Q’S) if, for every pair of instances I,J of S,T s.t. hI,Ji²M: Q’S (I) = QT (J)
WPE-II 16
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Equivalent reformulations may not exist
Any reformulation over S can only return values v such that T(v,v)
But there are instances J, s.t. T contains tuples in which a b
S(c) T(a,b)8x S(x) $ T(x,x)
Q(x) :- T(x,y)
… even if the mapping is exact
WPE-II 17
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Contained reformulation
Definition: Q’S is an contained reformulation of QT across M (denoted M ² Q’S v QT) if, for every pair of instances I,J of S,T s.t. hI,Ji²M: Q’S (I) µ QT (J)
WPE-II 18
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Maximally-contained reformulation
Definition: QSmax is a maximally-contained
reformulation of QT across M if: M ² QS
max v QT and Q’S v QS
max, for every Q’S s.t. M ² Q’S v QT
The union of all contained reformulations is a maximally-contained reformulation: QS
max ´ reformM(QT) ´ Q’S: M ² Q’S v QT Q’S
WPE-II 19
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05Maximally-contained reformulations
compute certain answers
Proposition ([AD98],[FKMP03],[T05]): Let certainM(Q) = I. certainM,I (Q)
Then:
certainM(Q) ´ reformM(Q)
(i.e.,: 8I, reformM(Q)(I) = certainM,I(Q) )
Note that the above holds for any mapping (i.e., not necessarily conjunctive)
WPE-II 20
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Reformulation algorithms (GAV)
Sound/exact GAV mappings: e.g. QS(x,y) v T(x)
Reformulation: for every relation Ti(x) of the target schema, let ri be the set of rules with Ti on their head (maybe > 1).
Let QTi(x) be the union of the conjunctive queries in the
body of the rules in ri
Substitute Ti(x) atoms in Q by QTi(x)
WPE-II 21
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Reformulation algorithms (LAV/GLAV)
Sound LAV/GLAV mappings: r: S1(x,y),…,Sn(x,y) v T1(x,z), …, Tm(x,z)
(note: Ti ’s are not necessarily distinct relational atoms)
(equivalent tgd: 8x,y S1(x,y),…,Sn(x,y) ! Ti(x,z),…, Tm(x,z))
Inverse rules ([DG97]): For every rule r and every i 2 [1..m] define a rule: Ti(x, fr,z1
(x,y), …, fr,zk(x,y)) :- S1(x,y),…,Sn(x,y)
(tgd: 8x,y S1(x,y),…,Sn(x,y) ! Ti(x,fr,z1(x,y),…, fr,zk
(x,y))
skolemization of existential variables)
WPE-II 22
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Inverse rules: Example
r: S1(x,y),S2(y,w) v T1(x,z),T1(z,w)
Inverse rules: T1(x,fr,z(x,y,w)) :- S1(x,y),S2(y,w) T1(fr,z(x,y,w),w) :- S1(x,y),S2(y,w)
Observe that the same skolem term (fr,z(x,y,w)) represents the common existential variable (z) of the two atoms
WPE-II 23
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Query reformulation using inverse rules
Create a logic program PQ composed by: the query Q the inverse rules of all mappings M
Let P(I) be the result of the evaluation of the composition of a logic program P with a set of facts I
Theorem ([DG97,AD98]): Let PQ+ be a logic program s.t. for every set of facts I, PQ+(I) is the result of discarding all tuples that contain skolem terms from PQ(I). Then:
PQ+ is a maximally-contained reformulation PQ+(I) = certainM,I(Q)
WPE-II 24
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Peer Data Management Systems
I1 I2
In
P2
Pn
P1
...
...
I3
P3
LAV source-to-peer mappings P2P mappings: inclusion
(sound) or equality (exact) GLAV + definitional (GAV)
Queries can be issued at any peer
Every peer can be both source and target w.r.t. different mappings
Pairs of peers may be indirectly connected (by paths of mappings)
Sn
S1 S2
S3
Mn3
M31 M23
M12
WPE-II 25
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Simple PDMS example
I1
S1
ProjMem
Area
r1:S1(n,p,a) µ ProjMem(n,p),Area(p,a)
SameProj
Author
I2
r2: S2(n1,n2) µ Author(n1,p), Author(n2,p)
r0: SameProj(n1,n2,p) = ProjMem(n1,p),ProjMem(n2,p)
Q(n1,n2) :- SameProj(n1,n2,p), Author(n1,p),Author(n2,p)
P1 P2
S1S2 S2
WPE-II 26
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Mapping Graph
ProjMem
Area
SameProj
Author
r2
r0a
r0b
r1
r1
r1: S1(n,p,a) µ ProjMem(n,p),Area(p,a)r2: S2(n1,n2) µ Author(n1,p),Author(n2,p)
r0a: SameProj(n1,n2,p) ¶ ProjMem(n1,p),ProjMem(n2,p)r0b: SameProj(n1,n2,p) µ ProjMem(n1,p),ProjMem(n2,p)
I1
I2
S1 S1S2 S2
P1 P2
WPE-II 27
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Query answering in PDMS
Theorem: ([HIST05]) In general, query answering in PDMS is undecidable
Reason: cycles in mapping graph For acyclic mapping graph: query answering is in
PTIME Still in PTIME, for a limited form of cycles (i.e., exact
mappings with some restrictions) Allows chains of sound (“LAV”) mappings and exact (“GAV”) mappings without projections
Piazza reformulation algorithm Sound and complete for acyclic mapping graph and limited form of cycles
Sound, in general (computes subset of certain answers)
WPE-II 28
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Piazza reformulation algorithm (1)
q: Q(n1,n2) :- SameProj(n1,n2,p), Author(n1,w), Author(n2,w)
q
SameProj(n1,n2,p)
Author(n1,w)
Author(n2,w)
r0
ProjMem(n1, p) ProjMem(n2, p)
ir1
a
S1(n1, p,_) S1(n2, p,_)
ir1a
ir2
a
S2(n1, n2)
S2(n2, n1)
ir2bir2
a
S2(n1, n2)S2(n2, n1)
ir2b
r1: S1(n,p,a) µ ProjMem(n,p),Area(p,a)
r0: SameProj(n1,n2,p) :- ProjMem(n1,p), ProjMem(n2,p)
ir1a: ProjMem(n,p) :- S2(n,p,a)r2: S2(n1,n2) µ Author(n1,p), Author(n2,p)ir2a: Author(n1,f(n1,n2)) :- S2(n1,n2)
ir2b: Author(n2,f(n1,n2)) :- S2(n1,n2)
WPE-II 29
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Piazza reformulation algorithm (2)
q
Q(n1,n2)
r0
SameProj(n1,n2,p)
Author(n1,w)
Author(n2,w)
ProjMem(n1, p) ProjMem(n2, p)
ir1
a
S1(n1, p,_) S1(n2, p,_)
ir1a
ir2
a
S2(n2, n1)S2(n1, n2)
ir2bir2
a
S2(n1, n2)S2(n2, n1)
ir2b
Q(n1,n2) :- (S1(n1,p,_)ÆS1(n2,p,_)) Æ (S2(n1,n2)[S2(n2,n1)) Æ (S2(n2,n1)[S2(n1,n2))
WPE-II 30
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Piazza reformulation algorithm (2)
q
Q(n1,n2)
r0
SameProj(n1,n2,p)
Author(n1,w)
Author(n2,w)
ProjMem(n1, p) ProjMem(n2, p)
ir1
a
S1(n1, p,_) S1(n2, p,_)
ir1a
ir2
a
S2(n2, n1)S2(n1, n2)
ir2bir2
a
S2(n1, n2)S2(n2, n1)
ir2b
Q(n1,n2) :- (S1(n1,p,_)ÆS1(n2,p,_)) Æ (S2(n1,n2)[S2(n2,n1)) Æ (S2(n2,n1)[S2(n1,n2))
WPE-II 31
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Piazza reformulation algorithm (2)
q
Q(n1,n2)
r0
SameProj(n1,n2,p)
Author(n1,w)
Author(n2,w)
ProjMem(n1, p) ProjMem(n2, p)
ir1
a
S1(n1, p,_) S1(n2, p,_)
ir1a
ir2
a
S2(n2, n1)S2(n1, n2)
ir2bir2
a
S2(n1, n2)S2(n2, n1)
ir2b
Q(n1,n2) :- (S1(n1,p,_)ÆS1(n2,p,_)) Æ (S2(n1,n2)[S2(n2,n1)) Æ (S2(n2,n1)[S2(n1,n2))
WPE-II 32
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Piazza reformulation algorithm (2)
q
Q(n1,n2)
r0
SameProj(n1,n2,p)
Author(n1,w)
Author(n2,w)
ProjMem(n1, p) ProjMem(n2, p)
ir1
a
S1(n1, p,_) S1(n2, p,_)
ir1a
ir2
a
S2(n2, n1)S2(n1, n2)
ir2bir2
a
S2(n1, n2)S2(n2, n1)
ir2b
Q(n1,n2) :- (S1(n1,p,_)ÆS1(n2,p,_)) Æ (S2(n1,n2)[S2(n2,n1)) Æ (S2(n2,n1)[S2(n1,n2)) ´ (S1(n1,p,_)ÆS1(n2,p,_)ÆS2(n1,n2)) (S1(n1,p,_)ÆS1(n2,p,_)ÆS2(n2,n1))
WPE-II 33
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Outline
Preliminaries Mapping languages Semantics of query answering
Query reformulation Query answering using data exchange Comparison
WPE-II 34
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Universal solutions
Data exchange setting S,T,M, instance I of S An instance J of T is a universal solution of the de setting above if it has homomorphisms to all other solutions
Solutions contain constants (i.e., values that appear in I) and variables (labeled nulls)
Homomorphism h: J1 → J2 between target instances: h(c) = c, for constant c If R(a1,…,am) is in J1,, then R(h(a1),…,h(am)) is in J2
WPE-II 35
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Universal solutions
IJ
J1
J2J3
Universal Solution
Solutions
h1 h2 h3 Homomorphisms
S
Source Target
TM
. . .
WPE-II 36
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Universal solutions example
M: 8x,z S(x,z) ! 9y T(x,y) Æ T(y,z) I = {S(a,b)}
Solutions:J1 = {T(a,a), T(a,b)} is not universal
J2 = {T(a,b), T(b,b)} is not universal
J3 = {T(a,X), T(X,b)} is universal
J4 = {T(a,X), T(X,b), T(a,Y), T(Y,b)} is universal
J5 = {T(a,X), T(X,b), T(Y,Y)} is not universal
. . .
WPE-II 37
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Computing universal solutions
Apply the chase procedure on joint instance hI,;i Source-to-target dependencies only: terminates
in PTIME and produces a joint instance hI,Ji, where J is a universal solution (chase(I))
Target dependencies: not guaranteed to terminate
If it does, it computes universal solution If it fails, no universal solution exists
WPE-II 38
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Example chase sequence
h1: x! a, y ! b, z ! c
)h1h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},{T(a,c,X1)}i
d1: 8x,y,z S(x,y)ÆS(y,z) ! 9w T(x,z,w)
h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},;i
extend to h1’ : w ! X1
WPE-II 39
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Example chase sequence
h1: x! a, y ! b, z ! c
)h1h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},{T(a,c,X1)}i
d1: 8x,y,z S(x,y)ÆS(y,z) ! 9w T(x,z,w)
h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},;i
extend to h1’ : w ! X1
h2: x! a, y ! b, z ! d
)h2h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},{T(a,c,X1),T(a,d,X2)}i
extend to h2’ : w ! X2
WPE-II 40
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Example chase sequence
h1: x! a, y ! b, z ! c
)h1h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},{T(a,c,X1)}i
d1: 8x,y,z S(x,y)ÆS(y,z) ! 9w T(x,z,w)
h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},;i
extend to h1’ : w ! X1
h2: x! a, y ! b, z ! d
)h2h{S(a,b),S(b,c),S(b,d),S(a,e),S(e,c)},{T(a,c,X1),T(a,d,X2)}i
extend to h2’ : w ! X2
h3: x! a, y ! e, z ! c extend to h3’ : w ! X1
not applicable!
WPE-II 41
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Universal solutions and query answering
Theorem ([FKMP]): If Q is a conjunctive query, I is a source instance
and J is a universal solution: Q(J)+= certainM,I(Q)
Any solution J, for which the above holds for any conjunctive query, is universal
WPE-II 42
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Outline
Preliminaries Mapping languages Semantics of query answering
Query reformulation Query answering using data exchange Comparison
WPE-II 43
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05Using inverse rules to compute universal
solutions
For every relation Ti of T, let PM,Ti be the
reformulation of the query Q(x) :- Ti(x), using the inverse rules algorithm.
Proposition: i PM,Ti (I) chase(I)
Crux: every step of a chase sequence corresponds to a step in the evaluation of the logic program using SLD resolution
Corollary: i PM,Ti (I) is a universal solution
WPE-II 44
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05Applying data exchange in GAV/LAV
settings
I1 I2 In
S2 Sn S1
T
...
...
Query Q
M1 M2 Mn
S
I
J1J2 Jn
J ...
WPE-II 45
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Performance tradeoffs
Data exchange: - requires the computation of a solution
(polynomial in the size of the instance I)- need to propagate updates in the source - may
require to recompute the whole universal solution
+ But then query evaluation is easy and efficient+ If query load is large, the cost of computing the
solution may be amortized
WPE-II 46
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Performance tradeoffs
Reformulation+ No “startup” cost+ No need to propagate updates- Adds overhead to query processing (although
reformulations for “common” queries can be precomputed/cached)
- Requires distributed query evaluation engine (but there is room for optimization, e.g., adaptive query processing)
- Generated reformulations are generally not minimal
WPE-II 47
UNIVERSITY of PENNSYLVANIA Grigoris Karvounarakis June 05
Conclusions
Two approaches for answering queries across mappings
Reformulation (data integration) Universal solutions (data exchange)
Different problems Data exchange is concerned with other aspects, e.g., identifying the appropriate solution to materialize
Same answers (certain answers) Performance tradeoffs Tight relationship between chase and inverse rules
techniques