virtual data integration helena galhardas dei ist (based on the slides of the course: cis 550 –...

31
Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Upload: sabrina-porter

Post on 16-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Virtual Data Integration

Helena GalhardasDEI IST

(based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Page 2: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Agenda

Terminology Conjunctive queries and Datalog Virtual Data Integration Architecture

Page 3: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

(Known) Terminology (1) Relational database composed of a set of

relations (or tables) Database schema includes a relational schema

for each table + set of integrity constraints Key constraints, functional dependencies, foreign key

constraints The rows of a relational table are called tuples

(or records) The NULL value means the values is not known

or doesn’t exist Equality test envolving NULL always returns FALSE

A database instance is a particular snapshot of the contents of the DB

Page 4: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

(Known) Terminology (2) Queries used to formulate users’ needs

Structured (SQL or XQuery) or unstructured (for the Web, e.g., list of keywords)

Q(D): result of applying query Q to the database D Views are created when we want to reuse the same

query expression in other queries Materialized view: the answer is computed and

maintained as the DB changes Queries can also be used to specify relationships

between schemas of data sources in a data integration context

Two different notations for queries: SQL Conjunctive queries, for formal purposes

Page 5: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Conjunctive queries Based on Mathematical Logic Conjunctive query has the form:

Q(X):- R1(X1), ....., Rn(Xn), c1, ..., cm R1(X1)....Rn(Xn) are the sub-goals (or conjuncts) and form

the body of the query Ris are database relations, Xi’s are tuples of variables and

constants The variables in X are called head variables or

distinguished variables; the others are called existencial variables

The predicate Q denotes the answer relation of the query The ci’s are interpreted atoms and have the form X θ Y,

with X, Y either variables or constants (but at least one is a variable), and θ is an interpreted predicate such as =, <=, >, !=, >, >=

Page 6: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Semantics of conjunctive queries Semantics of a conjunctive query Q over DB instance D: Ψ: any mapping that maps each of the variables in Q

to constants in D Ψ(Ri): result of applying Ψ to Ri(Xi), which consists

of ground atoms (constants) Ψ(Q): the result of applying Ψ to Q(X), which

consists of ground atoms If

Each of Ψ(R1),...., Ψ(Rn) is in D, and For each 1<=j<=m, Ψ(cj) is satisfied,

Then, Ψ(Q) is in the answer to Q over D

Page 7: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Correspondence between SQL queries and conjunctive queries Base relations:

Interview(candidate, name, recruiter, hireDecision, grade)EmployeePerformance(empID, name, reviewQuarter, grade,

reviewer) SQL:

Select recruiter, candidateFrom Interview, EmployeePerformanceWhere recruiter = name And grade < 2.5

Conjunctive query:Q1(Y,X):- Interview(X,D,Y,H,F),

EmployeePerformance(E,Y,T,W,Z), W <= 2.5

Page 8: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Safety, disjunction Conjunctive queries must be safe:

Every variable appearing in the head also appears in the body

Otherwise, the set of possible answers to the query may be infinite

Disjunction can be expressed writing two or more conjunctive queries with the same head predicate:

Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W<=2.5Q1(Y,X) :- Interview(X,D,Y,H,F), EmployeePerformance(E,Y,T,W,Z), W >=3.9

Page 9: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Negation

Conjunctive queries with negated goals can also be considered:

Q(X):- R1(X1), ..., Rn(Xn), ¬S1(Y1),…., ¬Sm(Ym) The notion of safety is extended to:

Any variable appearing in the head of the query must also appear in a positive subgoal

To produce an answer for the query, the mappings from the variables of Q to the constants in the database must satisfy:

Ψ(S1(Y1)),.... Ψ(Sm(Ym)) D

Page 10: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Datalog program Set of rules, each of which is a conjunctive query Instead of computing a single answer query,

computes a set of intensional relations (IDB relations), one of them is designated the query predicate

Extensional Database (EDB) relations: are the database relations, given by a set of tuples; can occur only in the body of rules

Intensional Database (IDB) relations: are defined by a set of rules; can occur both in the head and in the body of rules

Page 11: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Datalog Terminology An example datalog rule:

idb(x,y) r1(x,z), r2(z,y), z < 10

Irrelevant variables can be replaced by _ (anonymous var) Extensional relations or database schemas (edbs) are relations

only occurring in rules’ bodies, or as base relations: Ground facts only have constants, e.g.,

r1(‘abc’, 123) Intensional relations (idbs) appear in the heads – these are

basically views Distinguished variables are the ones output in the head

head subgoals

body

Page 12: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Example A database that includes a relation representing the

edges in a graph: edge(X,Y) The Datalog query:

r1: path(X,Y) :- edge(X,Y)r2: path(X,Y) :- edge(X,Z), path(Z,Y)computes the paths in the graph If r2 was replaced by:

r3: path(X,Y) :- path(X,Z), path(Z,Y)It would produce the same result

Page 13: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Semantics of a Datalog program Start with empty extensions for the IDB predicates Choose a rule in the program and apply it to the

current extension of the EDB and IDB relations Add the tuples computed for the head to its

extension Continue applying the rules of the program until no

new tuples are computed for the IDB relations The answer to the query is the extension of the

query predicate

Page 14: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Example (cont)r1: path(X,Y) :- edge(X,Y)r2: path(X,Y) :- edge(X,Z), path(Z,Y)

Start with: edge(1,2), edge(2,3), edge(3,4)Apply r1 returns:path(1,2), path(2,3), path(3,4)Apply r2 once:path(1,3), path(2,4)Apply r2 twice:path(1,4)

Page 15: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Datalog is Relationally CompleteWe can map RA Datalog: Selection p: p becomes a datalog subgoal

Projection A: we drop projected-out variables from head

Cross-product r s: q(A,B,C,D) :- r(A,B),s(C,D) Join r ⋈ s: q(A,B,C,D) :- r(A,B),s(C,D), condition Union r U s: q(A,B) :- r(A,B) ; q(C, D) :- s(C,D) Difference r – s: q(A,B) :- r(A,B), ¬ s(A,B)

Great… But then why do we care about Datalog?

Page 16: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

A Query We Can’t Answer in RA

Recall our example of a binary relation for graphs or trees (similar to an XML Edge relation):

edge(from, to)

If we want to know what nodes are reachable:

reachable(F, T, 1) :- edge(F, T) dist. 1reachable(F, T, 2) :- edge(F, X), edge(X, T) dist. 2reachable(F, T, 3) :- reachable(F, X, 2), edge(X, T) dist. 3

But how about all reachable paths? (Note this was easy in XPath over an XML representation -- //edge)

Page 17: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Recursive Datalog QueriesDefine a recursive query in datalog:

reachable(F, T, 1) :- edge(F, T) distance 1reachable(F, T, D + 1) :- reachable(F, X, D),

edge(X, T) distance >1

What does this mean, exactly, in terms of logic? There are actually three different (equivalent) definitions of

semantics; we will use that of fixpoint: We start with an instance of data, then derive all immediate

consequences We repeat as long as we derive new facts

In the RA, this requires a (restricted) while loop! “Inflationary semantics”

(which terminates in time polynomial in the size of the database!)

Page 18: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Our Query in RA + while(inflationary semantics, no negation)Datalog:

reachable(F, T, 1) :- edge(F, T)reachable(F, T, D+1) :- reachable(F, X, D), edge(X, T)

RA procedure with while:reachable += edge ⋈ literal1while change {

reachable += F, T, D(F ! X(edge) ⋈ T ! X,D ! D0(reachable) ⋈ add1)

}

Note literal1(F,1) and add1(D0,D) are actually arithmetic and literal functions modeled here as relations.

Page 19: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

A Special Type of Query: Conjunctive Queries

A single Datalog rule with no “Ç,” “:,” “8” can express select, project, and join – a conjunctive query

Conjunctive queries are possible to reason about statically

(Note that we can write CQ’s in other languages, e.g., SQL!)

We know how to “minimize” conjunctive queriesAn important simplification that can’t be done for general

SQL

We can test whether one conjunctive query’s answers always contain another conjunctive query’s answers (for ANY instance)

Why might this be useful?

Page 20: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Example of Containment

Suppose we have two queries:

q1(S,C) :- Student(S,N),Takes(S,C),Course(C, X), inCIS(C),Course(C,’DB & Info Systems’)

q2(S,C) :- Student(S,N),Takes(S,C),Course(C,X)

Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction

(i.e., it’s more restricted) There’s no union or any other way it can add more data

We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

Page 21: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Agenda

Terminology Conjunctive queries and Datalog Virtual Data Integration Architecture

Page 22: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Building a Data Integration SystemCreate a middleware mediator or data integration system

over the sources Can be warehoused (a data warehouse) or virtual Presents a uniform query interface and schema Abstracts away multitude of sources; consults them for relevant

data Unifies different source data formats (and possibly schemas) Sources are generally autonomous, not designed to be integrated

Sources may be local DBs or remote web sources/services Sources may require certain input to return output (e.g., web

forms) binding patterns describe these

Page 23: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Logical components of a virtual integration system

Programs whose role is to send queries to a data source, receive answers and possibly apply some basic transformation to the answer

Is built for the data integration application and contains only the aspects of the domain relevant to the application. Most probably will contain a subset of the attributes seen in sources

Specify the properties of the sources the system needs to know to use their data. Main component is semantic mappings that specify how attributes in the sources correspond to attributes in the mediated schema, how to resolve differences in how data values are specified in different sources

Other info is whether sources are complete

Page 24: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Example: a data integration scenario

Page 25: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Semantic mappingsSemantic mappings in the source descriptions describe: the relationship between the sources and the mediated

schema, for example: mapping of source S1 states it contains movies; the attribute name

in Movies maps to attribute title in the mediator Movie relation; the Actors relation in the mediated schema is a projection of the Movies source on the attributes name and actors

Whether the sources are complete, for example: Source S2 may not contain all the movie showing times in the entire

country Can specify limited access patterns to the sources, e.g.,:

All the playing times sources require a movie title as input

Page 26: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Components of a data integration system

Page 27: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Query reformulation (1) Rewrite the user query that was posed in terms

of the relations in the mediated schema, into queries referring to the schemas of data sources

Result is called a logical query plan: Set of queries that refer to the schemata of the data

sources and whose combination will yield the answer to the original query

Page 28: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Query reformulation (2) Ex: SELECT title, startTime

FROM Movie, Plays WHERE Movie.title = Plays.movie AND location= ‘New York’ AND director= ‘Woody Allen’

Tuples for Movie can be obtained from source S1 but attribute title needs to be reformulated to name

Tuples for Plays can be obtained from S2 or S3. Since S3 is complete for showings in NY, we choose it

Since source S3 requires the title of a movie as input and the title is not specified in the query, the query plan must first access S1 and then feed the movie titles returned from S1 as inputs to S3.

Page 29: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Query optimization

Acepts a logical query plan as input and produces a physical query plan Specifies the exact order in which sources are

accessed, when results are combined, which algorithms are used for performing operations on the data

Ex: The optimizer will decide the join algorithm to

combine results from S1 and S3. It may stream tuples from S1 and input them into S3, or it may batch them up before sending them to S3

Page 30: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Query execution

Responsible for the execution of the physical query plan Dispatches the queries to the individual sources

through the wrappers and combines the results as specified by the query plan.

Also may ask the optimizer to reconsider its plan based on its monitoring of the plan’s progress (e.g., if source S3 is slow)

Page 31: Virtual Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary

Referências Chapter 2, Draft of the book on “Principles of Data

Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation).

Slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Raghu Ramakrishnan and Johannes Gehrke, “Database Management Systems”, 3rd edition, McGraw-Hill

A. Silberchatz et al, “Database System Concepts”, 5th edition, McGraw-Hill

S. Abiteboul et al, “Foundations of Databases”, Addison Weley, 1995