data integration zachary g. ives university of pennsylvania cis 650 – database & information...

Data Integration

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

October 9, 2008

2

Administrivia

Next week:Tuesday – Database & Information Retrieval DayLevine 101

http://dbirday.cis.upenn.edu

Schema Mappings

We generally use queries as the basis of mappings Goals:

Compose a query with a set of mappings Intersect the constraints in the query and mappings – only

returning data matching the constraints (Possibly) compose chains of mappings (Possibly) invert mappings

The basic formalism: mappings are conjunctive queries Q(X) :- R1(X1), R2(X2), …, c1(X1)

And both queries and the overall set of mappings are unions of conjunctive queries

Why: tractability!

3

4

The Job of Mappings

Between different data sources: May have different numbers of tables – different

decompositions Attributes may be broken down differently (“rating” vs.

“EbertThumb” and “RoeperThumb”) Metadata in one relation may be data in another Values may not exactly correspond (“shows” vs.

“movies”) It may be unclear whether a value is the same

(“COPPOLA” vs. “Francis Ford Coppola”) May have different, but synonymous terms

(ImdbID “123456” SSN “987-45-3210”) Might have sub/superclass relationships

5

General Techniques

Value-value correspondences accomplished using concordance tables Join through a table mapping values to values Imdb_Actor(ID, SAG_actor_name)

Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction) Key question: what happens if a needed attribute is missing?

(e.g., DecentMovie has no genre) Super/subclass relationships generally must be

captured using selection (in one direction), union (in other direction)

… And sometimes we just can’t specify the correspondence!

6

Some Examples of Mappings

Show(ID, Title, Year, Lang, Genre)

Movie(ID, Title, Year, Genre, Director, Star1, Star2)

EnglishMovie(Title, Year, Genre, Rating)

Docu(ID, Title, Year)Participant(ID, Name, Role)

ImdbID

CastOf

1234 Catwoman

Name CastOf

Berry, H.

Monster’s Ball

PieceOfArt(I, T, Y, “English”, “G”) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)

Movie(I, T, Y, “doc”, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, “Dir”), Participant(I, S1, “Cast1”), Participant(I, S2, “Cast2”)

T1 T2

Need a concordance table from ImdbIDs to actress names

Query Answering with Mappings: Reformulation

Inputs: a query Q, a set of mappings M, and a set of sources S M1(X) :- R1(X1Y1), R2(X2Y2), …, c1(X1Y1),…

X M2(X) Y1Y2 R1(X1Y1) R2(X2Y2) … c1(X1Y1) …

Goal: a set of rewritings Q’, expressed as a union of conjunctive queries over S which typically returns the set of all certain answers

– those answers implied by the base data and the constraints expressed in the mappings

7

Kinds of Schema Mappings

Global As View (GAV): X M(X) Y1Y2 R1(X1Y1) R2(X2Y2) …

c1(X1Y1) … Q(X) :- MR(X1,Y1), MS(X2,Y2), …

Local As View (LAV): X MR(X) Y1Y2 R1(X1Y1) R2(X2Y2) …

c1(X1Y1) … Q(X) :- MR(X1,Y1), MS(X2,Y2), …

Global-Local As View (GLAV), aka Tuple-Generating Dependencies (TGDs): X MR(X1Z1),MS(X2Z1) Y1Y2 R1(W1Y1) R2(W2Y2) …

c1(X1Y1) … where X1 ⋃ X2 = W1 ⋃ W2 ⋃ …

8

Query Reformulation in Global-As-View

The most traditional scheme, implemented in most commercial systems Mediated schema is a view over source data Example real-world systems: IBM DB2 /

WebSphere Information Integrator; Oracle Fusion

Reuses query unfolding capabilities from a DBMS: Query over a View over Base data Query over

Base data

9

Query Unfolding: Basic Procedure

V1(x,y,z) :- R1(x,y,w), R2(w,u), R3(u,z) V2(x,y) :- R1(x,u), R2(u,y), R3(y,w),

R4(w,z)

Q(u) :- V1(u,v,w), V2(x,y)

Substitute the body of V1 into Q, renaming appropriately; repeat for V2

10

Challenges

If there are multiple rules for a view, unfolding may generate an exponential number of queries

Each query might be non-minimal

Leads to reasoning about query containment and equivalence If containment holds both ways between Q1, Q2 then

they are equivalent

We’ll see a containment check later…

11

Global-As-View: Summary

Very easy to implement – doesn’t require any new logic on the part of a regular DBMS engine For instance, Starburst QGM rewrites would work

But some drawbacks – primarily that: We don’t have a mechanism to describe when a

source contains only a subset of the data in the mediated schema e.g., “All books from this source are of type textbook”

The mediated schema often needs to change as we add sources – it is somewhat “brittle” because it’s defined in terms of sources

12

13

An Alternate Approach:Local-As-View

When you integrate something, you have some conceptual model of the integrated domain

Define that as a basic frame of reference, everything else as a view over it

“Local as View” using mappings that are conjunctive queries

May have overlapping/incomplete sources Define each source as the subset of a query over the

mediated schema – the “open world assumption” We can use selection or join predicates to specify that a

source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj =

“Computers”

14

The Local-as-View Model

The basic model is the following: “Local” sources are views over the mediated

schema Sources have the data – mediated schema is

virtual Sources may not have all the data from the

domain – “open-world assumption”

The system must use the sources (views) to answer queries over the mediated schema

15

Answering Queries Using Views

Assumption: conjunctive queries, set semantics Suppose we have a mediated schema:

show(ID, title, year, genre), rating(ID, stars, source) A conjunctive query might be:

q(t) :- show(i, t, y, g), rating(i, 5, s)

Recall intuitions about this class of queries: Adding a conjunct to a query (e.g., t = 1997)

removes answers from the result but never adds any

Any conjunctive query with at least the same constraints & conjuncts will give valid answers

16

Why This Class of Mappings & Queries?

Abiteboul & Duschka showed the data complexity of answering queries using views with OWA:

views queries

CQ CQ!= PQ datalog

FO

CQ PTIME

co-NP

PTIME

PTIME undec

CQ!= PTIME

co-NP

PTIME

PTIME undec

PQ co-NP

co-NP

co-NP

co-NP undec

datalog

co-NP

undec

co-NP

undec undec

FO undec

undec

undec

undec undec

Note that the common “inflationary semantics” version of Datalog must terminate in PTIME, even with recursion

17

Query Answering

Suppose we have the query:q(t) :- show(i, t, y, g), rating(i, 5, s)

and sources:5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s),

y = 1997

We want to compose the query with the source mappings – but they’re in the wrong direction!

18

Inverse Rules

We can take every mapping and “invert” it, though sometimes we may have insufficient information:

If5star(i) show(i, t, y, g), rating(i, 5, s)

then we can also infer that:show(i,??? ,??? ,??? ,???) 5star(i)

But how to handle the absence of the missing attributes? We know that there must be AT LEAST one instance

of ??? for each attribute for each show ID So we might simply insert a NULL and define that NULL

means “unknown” (as opposed to “missing”)…

19

But NULLs Lose Information

Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s)

If we look at the rule:goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s), y

= 1997

“By inspection,” q(t) goodMovies(t,y)

But if apply our inversion procedure, we get:show(i, t, y, g) goodMovies(t,y), i = NULL, g = “drama”,

y = 1997rating(i, r, s) goodMovies(t,y), i = NULL, r = 5, s = NULL

We need “a special NULL” so we can figure out which IDs and ratings match up

20

The Solution: “Skolem Functions”

Skolem functions: Conceptual “perfect” hash functions Each function returns a unique, deterministic value

for each combination of input values Every function returns a non-overlapping set of

values (Skolem function F will never return a value that matches any of Skolem function G’s values)

Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values They’re just a way of logically generating “special

NULLs”

21

Query Answering Using Inverse Rules

Invert all rules using the procedures describedTake the query and the possible rule

expansions and execute them in a Datalog interpreter In the previous query, we expand with all

combinations of expansions of show and of rating – every possible way of combining and cross-correlating info from different sources

Then discard unsatisfiable rewritings via unification, i.e., substituting in constants from the query for variables in the view

Finally, execute the union of all satisfiable rewritings

22

Pros & Cons of Inverse Rules

Works even with recursive queries, binding patterns, FDs on schemas

Generally, they take view definitions, split them, and re-join them to produce answers Not very efficient

No treatment of <, > predicates

Can we do better?

23

The Bucket Algorithm

Given a query Q with relations and predicates Create a bucket for each subgoal in Q Iterate over each view (source mapping)

If source includes bucket’s subgoal: Create mapping between q’s vars and the view’s var

at the same position If satisfiable with substitutions, add to bucket

Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)

For each result, do a containment check to make sure the rewriting is contained within the query

24

Let’s Try a Bucket Example

Queryq(t) :- show(i, t, y, g), rating(i, 5, s)

Sources5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r,

“TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r,s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i,

5, s), y = 1997 good98(t,y) show(i, t, y, “drama”), rating(i, 5, s),

y = 1998

25

Populating the Buckets

show(i,t,y,g)

rating(i,5,s)

5star(i) 5star(i)

TVguide(t,y,g,r)

TVguide(t,y,g,r)

movieInfo(i,t,y,g)

critics(i,r,s)

goodMovies(t,y)

goodMovies(t,y)

good98(t,y) good98(t,y)

26

Evaluation

On the board…

27

Example of Containment Testing

Suppose we have two queries:

q1(t) :- show(i, t, y, g), rating(i, 5, s) , y = 1997 q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)

Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction

(i.e., it’s more restricted) There’s no union or any other way it can add more data

We can say that q2 contains q1 because this holds for any instance of our mediated schema

28

Checking Containment via Canonical Databases

To test for q1 µ q2: Create a “canonical DB” that contains a tuple for

each subgoal in q1 Execute q2 over it If q2 returns a tuple that matches the head of q1,

then q1 µ q2

(This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)

Let’s see this for our example…

29

Example Canonical DB

q1(t) :- show(i, t, 1997, g), rating(i, 5, s)q2(t,y) :- show(i, t, y, “drama”), rating(i, 5, s)

show rating

i t 1997

g i 5 s

Need to get tuple <t> in executing q2 over this database

What if q2 didn’t ask for g = drama?

30

Buckets, Rev. 2: The MiniCon Algorithm

A “much smarter” bucket algorithm: In many cases, we don’t need to perform the

cross-product of all items in all buckets Eliminates the need for the containment check

This – and the Chase & Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today

31

Minicon Descriptions (MCDs)

Basically, a modification to the bucket approach “head homomorphism” – defines what variables

must be equated Variable-substituted version of the subgoals Mapping of variable names Info about what’s covered

Property 1: If a variable occurs in the head of a query, then

there must be a corresponding variable in the head of the MCD view

If a variable participates in a join predicate in the query, then it must be in the head of the view

32

MCD Construction

For each subgoal of the queryFor each subgoal of each view

Choose the least restrictive head homomorphism to match the subgoal of the query

If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1

33

MCDs for Our Example5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r,s) rating(i, r, s)goodMovies(t,y) show(i, t, 1997, “drama”), rating(i, 5,

s)good98(t,y) show(i, t, 1998, “drama”), rating(i, 5, s)

view h.h. mapping goals sat.

5star(i) ii ii 2

TVguide(t,y,g,r)

tt, yy, gg tt, yy, gg, rr

1,2

movieInfo(i,t,y,g)

ii, tt, yy, gg

ii, tt, yy, gg

1

critics(i,r,s) ii, rr, ss ii, rr, ss 2

goodMovies(t,y)

tt,yy tt, yy 1,2

good98(t,y) tt,yy tt, yy 1,2

q(t) :- show(i, t, y, g), rating(i, r, s), r = 5

34

Combining MCDs

Now look for ways of combining pairwise disjoint subsets of the goals Greatly reduces the number of candidates! Also proven to be correct without the use of a

containment check

Variations need to be made for: Constants in general (I sneaked those in) “Semi-interval” predicates (x <= c)

Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work

35

MiniCon and LAV Summary

The state-of-the-art for AQUV in the relational world of data integration It’s been extended to support “conjunctive XQuery” as well

Scales to large numbers of views, which we need in LAV data integration

Chase & Backchase by Tannen et al. A procedure that has very close connections to inverse rules Slightly more general in some ways – but:

Produces equivalent rewritings, not maximally contained ones Not always polynomial in the size of the data

36

Recall

Next reading assignment: DeWitt and Kabra Avnur and Hellerstein Compare the different approachesStart thinking about what you’d like to do for a

project One-page proposal of your project scope,

goals, and means of assessing success/failure due next Monday, Feb. 28th

By now you should have a good idea of what most of the ideas in the handout involve

data integration zachary g. ives university of pennsylvania cis 650 – database & information...

Documents