local-as-view data integration zachary g. ives university of pennsylvania cis 650 – database &...
Post on 20-Dec-2015
218 views
TRANSCRIPT
Local-as-View Data Integration
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Database & Information Systems
February 21, 2005
2
Administrivia
Next reading assignment: DeWitt and Kabra Avnur and Hellerstein Compare the different approachesStart thinking about what you’d like to do for a
project One-page proposal of your project scope,
goals, and means of assessing success/failure due next Monday, Feb. 28th
By now you should have a good idea of what most of the ideas in the handout involve
3
Today’s Trivia Question
4
Virtues of TSIMMIS
Early adopter of semistructured data, greatly predating XML Can support data from many different kinds of
sources Obviously, doesn’t fully solve heterogeneity
problem
Presents a mediated schema that is the union of multiple views Query answering based on view unfolding
Easily composed in a hierarchy of mediators
5
Limitations of TSIMMIS’ Approach
Some data sources may contain data with certain ranges or properties
“Books by Aho”, “Students at UPenn”, … If we ask a query for students at Columbia, don’t
want to bother querying students at Penn… How do we express these?
Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema
6
An Alternate Approach:The Information Manifold (Levy et al.)
When you integrate something, you have some conceptual model of the integrated domain
Define that as a basic frame of reference, everything else as a view over it
“Local as View” using mappings that are conjunctive queries
May have overlapping/incomplete sources Define each source as the subset of a query over the
mediated schema – the “open world assumption” We can use selection or join predicates to specify that a
source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj =
“Computers”
7
The Local-as-View Model
The basic model is the following: “Local” sources are views over the mediated
schema Sources have the data – mediated schema is
virtual “Open world” assumption: sources may not
have all the data from the domain, so we can’t answer queries with negation
The system must use the sources (views) to answer queries over the mediated schema
8
Answering Queries Using Views
Assumption: conjunctive queries, set semantics Suppose we have a mediated schema:
show(ID, title, year, genre), rating(ID, stars, source) A conjunctive query might be:
q(t) :- show(i, t, y, g), rating(i, 5, s)
Recall intuitions about this class of queries: Adding a conjunct to a query (e.g., t = 1997)
removes answers from the result but never adds any
Any conjunctive query with at least the same constraints & conjuncts will give valid answers
9
Why This Class of Mappings & Queries?
Abiteboul & Duschka showed the data complexity of answering queries using views with OWA:
views queries
CQ CQ!= PQ datalog
FO
CQ PTIME
co-NP
PTIME
PTIME undec
CQ!= PTIME
co-NP
PTIME
PTIME undec
PQ co-NP
co-NP
co-NP
co-NP undec
datalog
co-NP
undec
co-NP
undec undec
FO undec
undec
undec
undec undec
Note that the common “inflationary semantics” version of Datalog must terminate in PTIME, even with recursion
10
Query Answering
Suppose we have the query:q(t) :- show(i, t, y, g), rating(i, 5, s)
and sources:5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s),
y = 1997
We want to compose the query with the source mappings – but they’re in the wrong direction!
11
Inverse Rules
We can take every mapping and “invert” it, though sometimes we may have insufficient information:
If5star(i) show(i, t, y, g), rating(i, 5, s)
then we can also infer that:show(i,??? ,??? ,??? ,???) 5star(i)
But how to handle the absence of the missing attributes? We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID So we might simply insert a NULL and define that NULL
means “unknown” (as opposed to “missing”)…
12
But NULLs Lose Information
Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s)
If we look at the rule:goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s), y
= 1997
“By inspection,” q(t) goodMovies(t,y)
But if apply our inversion procedure, we get:show(i, t, y, g) goodMovies(t,y), i = NULL, g = “drama”,
y = 1997rating(i, r, s) goodMovies(t,y), i = NULL, r = 5, s = NULL
We need “a special NULL” so we can figure out which IDs and ratings match up
13
The Solution: “Skolem Functions”
Skolem functions: Conceptual “perfect” hash functions Each function returns a unique, deterministic value
for each combination of input values Every function returns a non-overlapping set of
values (Skolem function F will never return a value that matches any of Skolem function G’s values)
Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values They’re just a way of logically generating “special
NULLs”
14
Query Answering Using Inverse Rules
Invert all rules using the procedures describedTake the query and the possible rule
expansions and execute them in a Datalog interpreter In the previous query, we expand with all
combinations of expansions of show and of rating – every possible way of combining and cross-correlating info from different sources
Then discard unsatisfiable rewritings via unification, i.e., substituting in constants from the query for variables in the view
Finally, execute the union of all satisfiable rewritings
15
Pros & Cons of Inverse Rules
Works even with recursive queries, binding patterns, FDs on schemas
Generally, they take view definitions, split them, and re-join them to produce answers Not very efficient
No treatment of <, > predicates
Can we do better?
16
The Bucket Algorithm
Given a query Q with relations and predicates Create a bucket for each subgoal in Q Iterate over each view (source mapping)
If source includes bucket’s subgoal: Create mapping between q’s vars and the view’s var
at the same position If satisfiable with substitutions, add to bucket
Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)
For each result, do a containment check to make sure the rewriting is contained within the query
17
Let’s Try a Bucket Example
Queryq(t) :- show(i, t, y, g), rating(i, 5, s)
Sources5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r,
“TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r,s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i,
5, s), y = 1997 good98(t,y) show(i, t, y, “drama”), rating(i, 5, s),
y = 1998
18
Populating the Buckets
show(i,t,y,g)
rating(i,5,s)
5star(i) 5star(i)
TVguide(t,y,g,r)
TVguide(t,y,g,r)
movieInfo(i,t,y,g)
critics(i,r,s)
goodMovies(t,y)
goodMovies(t,y)
good98(t,y) good98(t,y)
19
Evaluation
On the board…
20
Example of Containment Testing
Suppose we have two queries:
q1(t) :- show(i, t, y, g), rating(i, 5, s) , y = 1997 q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)
Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction
(i.e., it’s more restricted) There’s no union or any other way it can add more data
We can say that q2 contains q1 because this holds for any instance of our mediated schema
21
Checking Containment via Canonical Databases
To test for q1 µ q2: Create a “canonical DB” that contains a tuple for
each subgoal in q1 Execute q2 over it If q2 returns a tuple that matches the head of q1,
then q1 µ q2
(This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)
Let’s see this for our example…
22
Example Canonical DB
q1(t) :- show(i, t, 1997, g), rating(i, 5, s)q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)
show rating
i t 1997
g i 5 s
Need to get tuple <t> in executing q2 over this database
What if q2 didn’t ask for g = drama?
23
Buckets, Rev. 2: The MiniCon Algorithm
A “much smarter” bucket algorithm: In many cases, we don’t need to perform the
cross-product of all items in all buckets Eliminates the need for the containment check
This – and the Chase & Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today
24
Minicon Descriptions (MCDs)
Basically, a modification to the bucket approach “head homomorphism” – defines what variables
must be equated Variable-substituted version of the subgoals Mapping of variable names Info about what’s covered
Property 1: If a variable occurs in the head of a query, then
there must be a corresponding variable in the head of the MCD view
If a variable participates in a join predicate in the query, then it must be in the head of the view
25
MCD Construction
For each subgoal of the queryFor each subgoal of each view
Choose the least restrictive head homomorphism to match the subgoal of the query
If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1
26
MCDs for Our Example5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r,s) rating(i, r, s)goodMovies(t,y) show(i, t, 1997, “drama”), rating(i, 5,
s)good98(t,y) show(i, t, 1998, “drama”), rating(i, 5, s)
view h.h. mapping goals sat.
5star(i) ii ii 2
TVguide(t,y,g,r)
tt, yy, gg tt, yy, gg, rr
1,2
movieInfo(i,t,y,g)
ii, tt, yy, gg
ii, tt, yy, gg
1
critics(i,r,s) ii, rr, ss ii, rr, ss 2
goodMovies(t,y)
tt,yy tt, yy 1,2
good98(t,y) tt,yy tt, yy 1,2
q(t) :- show(i, t, y, g), rating(i, r, s), r = 5
27
Combining MCDs
Now look for ways of combining pairwise disjoint subsets of the goals Greatly reduces the number of candidates! Also proven to be correct without the use of a
containment check
Variations need to be made for: Constants in general (I sneaked those in) “Semi-interval” predicates (x <= c)
Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work
28
MiniCon and LAV Summary
The state-of-the-art for AQUV in the relational world of data integration It’s been extended to support “conjunctive XQuery” as well
Scales to large numbers of views, which we need in LAV data integration
A similar approach: Chase & Backchase by Tannen et al. Slightly more general in some ways – but:
Produces equivalent rewritings, not maximally contained ones Not always polynomial in the size of the data
29
Recall
Next reading assignment: DeWitt and Kabra Avnur and Hellerstein Compare the different approachesStart thinking about what you’d like to do for a
project One-page proposal of your project scope,
goals, and means of assessing success/failure due next Monday, Feb. 28th
By now you should have a good idea of what most of the ideas in the handout involve