oblivious querying of data with irregular structure
Post on 08-Jan-2016
37 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Oblivious Querying of Data with Irregular Structure
Yaron Kanza
The Rachel and Selim BeninSchool of Engineering and Computer Science
The Hebrew University of Jerusalem
2
Joint Work With
• Queries with Incomplete Answers– Werner Nutt, Shuky Sagiv
• Flexible Queries– Shuky Sagiv
• SQL4X– Sara Cohen, Shuky Sagiv
• Computing Full Disjunctions– Shuky Sagiv
3
AgendaAgenda
Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
4
AgendaAgenda
Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
5
The Semistructured Data Model
• Data is described as a rooted labeled directed graph
• Nodes represent objects
• Edges represent relationships between objects
• Atomic values are attached to atomic nodes
6
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
A Movie Database ExampleA Movie Database Example
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
7
<?xml version=“1.0”?>
<MDB>
<Movie>
<Title>Star Wars</Title>
<Year>1977</Year>
<Actor>
<Name>Mark Hamill</Name>
</Actor>
<Actor>
<Name>Harrison Ford</Name>
</Actor>
</Movie>
…</MDB>
<?xml version=“1.0”?>
<MDB>
<Movie>
<Title>Star Wars</Title>
<Year>1977</Year>
<Actor>
<Name>Mark Hamill</Name>
</Actor>
<Actor>
<Name>Harrison Ford</Name>
</Actor>
</Movie>
…</MDB>
XML that Encodes the Semistructured DataXML that Encodes the Semistructured Data
8
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
Consider a Query that RequestsMovies, Actors that Acted in the Movies
and the Movies’ Year of Release
Consider a Query that RequestsMovies, Actors that Acted in the Movies
and the Movies’ Year of Release
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
What Should be theform of the Query?
9
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
The movie has a year attribute
Incomplete DataIncomplete Data
The year of the movie is missing
10
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3234 35
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
33Magnolia
Variations in StructureVariations in Structure
11
Movie below actor
29
14
2121
Actor below movie
11
1
11 12 13
Movie Database
Movie
Movie
Actor
22 23 25 26 27 2829
T.V. Series
Film
ActorActor
TitleName Name
Name
Title
Title Title
31 3233 34
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
35
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
13
Title
34Magnolia
A movie label A film label
Ontology VariationsOntology VariationsDealing with ontology variations isbeyond the scope of this talk
Dealing with ontology variations isbeyond the scope of this talk
12
Irregular Data
• Data is incomplete– Missing values of attributes in objects
• Data has structural variations– Relationships between objects are represented
differently in different parts of the database
• Data has ontology variations– Different labels are used to describe objects of
the same type
13
Irregular data does not conform to a strict schemaIrregular data does not conform to a strict schema
Queries over irregular data should not be rigid patternsQueries over irregular data should not be rigid patterns
The schema cannot guide a userin formulating a query
The schema cannot guide a userin formulating a query
14
The description of the
schema is large
(e.g., a DTD of XML)
The description of the
schema is large
(e.g., a DTD of XML)
It is difficult to use the schema when formulating queries
It is difficult to use the schema when formulating queries
Data is contributedby many users in a variety of designs
Data is contributedby many users in a variety of designs
The query should deal with differentstructures of data
The query should deal with differentstructures of data
The structure of the
database is changed
frequently
The structure of the
database is changed
frequently
Queries should be rewritten frequentlyQueries should be rewritten frequently
In Which Cases is it Difficult to Formulate Queries over Semistructured Data?
In Which Cases is it Difficult to Formulate Queries over Semistructured Data?
15
Can Regular Expressions Help in Querying Irregular Data?
• In many cases, regular expressions can be used to query irregular data
• Yet, regular expressions are – Not efficient – it is difficult to evaluate regular
expressions– Not intuitive – it is difficult for a naïve user to
formulate regular expressions
16
More on UsingRegular Expressions
• When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database– For n types of objects, there are n! possible
hierarchies– For an object with n attributes, there are 2n
subsets of missing attributes
17
AgendaAgenda
Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
18
Queries with Incomplete Answers
• We have developed queries that deal with incomplete data in a novel way and return incomplete answers
• The queries return maximal answers rather than complete answers
• Different query semantics admit different levels of incompleteness
19
Queries with Incomplete AnswersQueries with Incomplete Answers
Queries with complete answersQueries with complete answers
Queries with AND SemanticsQueries with AND Semantics
Queries with Weak SemanticsQueries with Weak Semantics
Queries with OR SemanticsQueries with OR Semantics
Increasinglevel of incompleteness
20
Queries and Matchings
• The queries are labeled rooted directed graphs
• Query nodes are variables
• Matchings are assignments of database objects to the query variables according to – the constraints specified in the query, and – the semantics of the query
21
• Root Constraint: • Satisfied if the query root is mapped to the db root
• Edge Constraint: • Satisfied if a query edge with label l is mapped to a
database edge with label l
Constraints On Complete Matchings
r 1Query Root Database Root
x
y
12
25
l l
22
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
yx
z
u
UncreditedActor
Name
32
Name
34
2927
Movie Movie
Director UncreditedActor
14 May 1944
Date of birth
35
v
NameDate of birth
GeorgeLucas
A CompleteMatching
A CompleteMatching
ProducerProducer
1
12
27
32
11
35
All the nodes are mapped to non-null values
The root constraint and all the edge constraintsare satisfied
23
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
yx
z
u
UncreditedActor
Name
32
Name
34
2927
Movie Movie
Director UncreditedActor
14 May 1944
Date of birth
35
v
NameDate of birth
Consider the case where Node 35is removed from the database
14 May 1944
Date of birth
35
GeorgeLucas
No CompleteMatching Exists!
No CompleteMatching Exists!
ProducerProducer
StarWars
1977
24
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
yx
z
u
UncreditedActor
Name
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
GeorgeLucas
Not Every Partial Assignmentis an Incomplete Matching
Not Every Partial Assignmentis an Incomplete Matching
ProducerProducer
1
This is not a matching, since the sequence of labelsfrom the database root to Node 31 is different fromany sequence of labels that starts at the query rootand ends in variable v
u
NULL
z NULL
y
NULL
xNULL
31
25
The Reachability Constrainton Partial Matchings
• A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied Database
x
z
w
y
l1
r
v
l3
l2
l5
l4
l6
Query
w
y
r
v
l3
l5
v
1
55
5
8
l1
1
l3
l5
55v
x
z
r l2
l4
l6
7
9
1 l2
l4
l6
55
26
yx
z
Director Actor
r
Producer
“And” Matchings
• A partial matching is an AND matching if– The root constraint is satisfied– The reachability constraint is satisfied by every
query node that is mapped to a database node– If a query node is mapped to a database node,
all the incoming edge constraints are satisfied
27
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
An AND MatchingAn AND Matching
GeorgeLucas
Director
StevenSpielberg
Director
12
r
yx
z
u
UncreditedActor
Name
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
1
12
27
32
Producer
11Producer
u
NULL
28
UncreditedActor
UncreditedActor
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
yx
z
uName
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
Suppose that we remove the edges that are labeled withUncredited Actor
GeorgeLucas
ProducerProducer
In an AND matching,Node z must be null!
In an AND matching,Node z must be null!
29
• Edge Constraint: • Is Weakly Satisfied if it is either
• Satisfied (as defined earlier), or• One (or more) of its nodes is mapped to a null value
Weak Satisfaction of Edge Constraints
x
y
12
25
l l
x
y
12
25
l m
null
x
y
12
25
l m
nullx
y
l
null
null
30
Weak Matchings
• A partial matching is a weak matching if– The root constraint is satisfied
– The reachability constraint is satisfied by every query node that is mapped to a database node
– Every edge constraint is weakly satisfied
31
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
A Weak MatchingA Weak Matching
GeorgeLucas
Director
StevenSpielberg
Director
12
r
yx
z
uName
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
1
27
32
Producer
11Producer
u
NULL
y
NULL
Edges that areweakly satisfied
32
x
y
12
25
l l
x
y
12
25
l m
null
x
y
l
null
null
x
y
12
25
l m
null
In a weak matching, all four options are permitted
In an AND matching, only the first three options are permitted
33
ProducerProducer
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
yx
z
uName
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
Consider the case where edges labeled with Producer are removed
GeorgeLucas
Producer
In a weak matching,Node z must be null!
In a weak matching,Node z must be null!
34
“OR” Matchings
• A partial matching is an OR matching if– The root constraint is satisfied
– The reachability constraint is satisfied by every query node that is mapped to a database node
35
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
An OR MatchingAn OR Matching
GeorgeLucas
Director
StevenSpielberg
Director
12
r
yx
z
uName
32
Name
34
2927
Movie Movie
Director UncreditedActor
v
NameDate of birth
1
27
32
11Producer
u
NULL
y
NULL
An edge whichis not weaklysatisfied
36
Increasing Level of Incompleteness
• A complete matching is an AND matching
• An AND matching is a weak matching
• A weak matching is an OR matching
37
t1=(1, 5, 2, null)
t2=(1, null, 2, null)
Maximal Matchings
• A tuple t1 subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values:
• A matching is maximal if no other matching subsumes it
• A query result consists of maximal matchings only
Matchings are represented as tuples of oid’s and null values
38
On the Complexity of Computing Queries with Incomplete Answers
• The size of the result can be exponential in the size of the input (database and query)– Note that the same is true when joining
relations – the size of the result can be exponential in the size of the input (database and query)
• Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity
39
Input-Output Complexity
In input-output complexity, the time complexity is a function ofthe size of the query,the size of the database, and the size of the result.
In input-output complexity, the time complexity is a function ofthe size of the query,the size of the database, and the size of the result.
40
The Motivation for Using I/O Complexity
• Measuring the time complexity with respect to the size of the input does not separate between the following two cases:– An algorithm that does an exponential amount of work
simply because the size of the output is exponential in the size of the input
– An algorithm that does an exponential amount of work even when the query result is small
• Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard
41
I/O Complexity of Query Evaluation (lower bounds are for non-emptiness)
Query / Semantics
Path
Query
Tree
QueryDAG
QueryCyclic Query
CompletePTIMEPTIMENP-
CompleteNP-
Complete
ANDPTIMEPTIMEPTIMENP-
Complete
WeakPTIMEPTIMEPTIMEPTIME
ORPTIMEPTIMEPTIMEPTIME
Recent Results (PODS’03)
42
Filter Constraints
• Constraints that filter the results (i.e., the maximal matchings)
• There are – Weak filter constraints (the constraint is
satisfied if a variable in the constraint is null)– Strong filter constraints (all variables must be
non-null for satisfaction)
• Existence constraint: !x is true if x is not null
43
I/O Complexity of Query Evaluation with Existence Constraints
(lower bounds are for non-emptiness)Query /
Semantics
Path
Query
Tree
QueryDAG
QueryCyclic Query
CompletePTIMEPTIMENP-
CompleteNP-
Complete
ANDPTIMEPTIMENP-
CompleteNP-
Complete
WeakPTIMEPTIMENP-
CompleteNP-
Complete
ORPTIMEPTIMENP-
CompleteNP-
Complete
44
I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints
(lower bounds are for non-emptiness)Query /
Semantics
Path
Query
Tree
QueryDAG
QueryCyclic Query
StrongPTIMENP-
CompleteNP-
CompleteNP-
Complete
ANDPTIMENP-
CompleteNP-
CompleteNP-
Complete
WeakPTIMENP-
CompleteNP-
CompleteNP-
Complete
ORPTIMENP-
CompleteNP-
CompleteNP-
Complete
45
Query Containment
• Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers
• Q1 Q2 if for all database D,
every matching of Q1 w.r.t. to Dis subsumed by
a matchings of Q2 w.r.t. to D• Query containment (query equivalence) is useful
for the development of optimization techniques
46
Containment in AND Semantics
• Homomorphism between the query graphs is necessary and sufficient for containment
r
y
x
z
l1
v
l2l2
u
l3 l4
Q1 r
q
p
l1
v
l2
u
l3 l4
Q2
homomorphism
• Deciding whether one query is contained in another is NP-Complete
Q1 Q2
47
Containment in OR Semantics
• The following is a necessary and sufficient condition for query containment in OR semantics
• For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1
– is in ΠP2
– NP-Complete if the containee is a tree
– polynomial if the container is a tree
48
Containment in Weak Semantics
• Similar to containment in OR Semantics, with the following difference
• Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments– A graph fragment is a restriction of the query to
a subset of the variables that includes the query root such that every node in the fragment is reachable from the root
49
AgendaAgenda
Why is it difficult to query semistructured data?Why is it difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
50
Flexible Queries
• To deal with structural variations in the data, we have developed flexible queries
51
Flexible QueriesFlexible Queries
Rigid QueriesRigid Queries
Semiflexible Queries Semiflexible Queries
Flexible QueriesFlexible Queries
Increasing level of flexibility
52
A query that finds all pairs of actorsthat acted in the same movie
A query that finds all pairs of actorsthat acted in the same movie
However, if in the database, actorsare descendents of movies, the query has to be reformulated
However, if in the database, actorsare descendents of movies, the query has to be reformulated
Instead, we propose new waysof matching queries to databases
Instead, we propose new waysof matching queries to databases
r
yx
z
Actor Actor
Movie Movie
Movie Database
Example
53
Rigid matchings andcomplete matchings
are the same
Returning rigid matchings is the usual semantics for queries
(e.g., XQuery, Lorel, XML-QL, etc.)
Rigid matchings andcomplete matchings
are the same
Returning rigid matchings is the usual semantics for queries
(e.g., XQuery, Lorel, XML-QL, etc.)
54
• Root Constraint: • Satisfied if the query root is mapped to the db root
• Edge Constraint: • Satisfied if a query edge with label l is mapped to a
database edge with label l
Constraints On Rigid Matchings
r 1Query Root Database Root
x
y
12
25
l l
55
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26
27
2829
T.V. SeriesActorActor
TitleName Name
NameTitle
Title Title
31 32 3435
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
r
x
y
Actor
Movie
1
14
29
A Rigid Matching
1
25
12
This is not a Rigid Matching
56
A Semiflexible Matching• The query root is mapped
to the db root
y
l
x
11
l
9
×
r 1
Query Root
DB Root
• A query node with an incoming label l is mapped to a db node with an incoming label l
• The image of every query path is embedded in some database path
• SCC is mapped to SCC
57
A Semiflexible Matching• The query root is
mapped to the db root • A query node with an
incoming label l is mapped to a db node with an incoming label l
• The image of every query path is embedded in some database path
• SCC is mapped to SCC
y
l
x
11
l
9
r 1
Query Root
DB RootThe last two conditions
cannot be verified locally, i.e., by considering one query edge at a time
The last two conditionscannot be verified locally, i.e., by considering one query edge at a time
58
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26
27
2829
T.V. SeriesActorActor
TitleName Name
NameTitle
Title Title
31 32 3435
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
21
Actor
Name
30
Mark Hamill
Léon
Movie
r
x
y
Actor
Movie
1
25
12
The Semiflexible MatchingsThe Semiflexible Matchings
1
14
29
We get all theactor-movie pairs
We get all theactor-movie pairs
1
22
1111
1
21
59
r
y
x
Actor
Movie
r
x
y
Actor
Movie
Under semiflexible semantics,these two queries are equivalent
Under semiflexible semantics,these two queries are equivalent
The user does not have to knowif movies are above or below
actors in the database
The user does not have to knowif movies are above or below
actors in the database
60
1
11 12 14
Movie Database
Movie
Movie
Actor
22 23 25 26
27
2829
T.V. SeriesActorActor
TitleName Name
NameTitle
Title Title
31 32 3435
KyleMacLachlan
NataliePortman
Harrison Ford
1977
Dune
StarWars
TwinPeaks
36
Year
1984
24
Year
21
Actor
Name30
Mark Hamill
Léon
Movie
r
xy
Actor
Movie
Another Example of aSemiflexible Matching
Another Example of aSemiflexible Matching
We get pairs of actors that acted in
the same movie
We get pairs of actors that acted in
the same movie
zMovie
Actor
1
21
11
22
1
11
1
11
21 2222
1
11
1
21
11
Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree
Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree
61
A Flexible Matching
• The query root is mapped to the db root r 1
Query Root
DB Root
x 9
y 11
l l
• A query node with an incoming label l is mapped to a db node with an incoming label l
• An edge is mapped to two nodes on one path
• Notice that a path in the query is not necessarily mapped to a path in the db
62
An Example of a Flexible Queryr
x
Director
A director
y
Name
The director name
z
Movie
A movie of the director
vTitle
The title of the movieu
Actor
An actor in the movieName
wThe name of the actor
63
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
y
x
z
u
Name
32
Name
34
2927
MovieName
Director
14 May 1944
Date of birth
35
vTitle
NameGeorgeLucas
Producer
Actor
w
1
29
12
34
26
33
25
A query edge is mapped to
two db nodes on one path
A query edge is mapped to
two db nodes on one pathThis flexible matching is neither a rigid
matching nor a semiflexible matching
This flexible matching is neither a rigid
matching nor a semiflexible matching
64
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
y
x
Name
32
Name
34
2927
Movie
Producer
14 May 1944
Date of birth
35
GeorgeLucas
Producer
1
Why are semiflexible matchings
preferred sometimes to flexible matchings?
Why are semiflexible matchings
preferred sometimes to flexible matchings?
27
11
In this flexible matching, a producer is given
with a movie that he directed but did not produce
In this flexible matching, a producer is given
with a movie that he directed but did not produce
65
99
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
TitleTitle
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
y
x
Name
32
Name
34
2927
Movie
Producer
14 May 1944
Date of birth
35
GeorgeLucas
Producer
1
99
11
In semiflexible semantics, the problem is solved
since the image of a query path is embedded in
a database path
In semiflexible semantics, the problem is solved
since the image of a query path is embedded in
a database path
Producer
66
Differences Between the Semiflexible and Flexible Semantics
• On a technical level, in flexible matchings – Query paths are not necessarily embedded in database
paths
– SCC’s are not necessarily mapped to SCC’s
• On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence– Query paths are embedded in database paths
• In the flexible semantics, this condition is relaxed:– Query edges are embedded in database paths
67
Increasing Level of Flexibility
• A rigid matching is a semiflexible matching
• A semiflexible matching is a flexible matching
68
Verifying that Mappings are Semiflexible Matchings
• Is a given mapping of query nodes to database nodes a semiflexible matching?– Not as simple as for rigid matchings (no local test, i.e.,
need to consider paths rather than edges)
• In a dag query, the number of paths may be exponential– Yet, verifying is in polynomial time
• In a cyclic query, the number of paths may be infinite– Yet, verifying is in exponential time
69
Verifying that a Mapping is a Semiflexible Matching
Query / Database
Path Query
Tree Query
DAG Query
Cyclic Query
Path DatabasePTIMEPTIMEPTIMENo
matchings
Tree DatabasePTIMEPTIMEPTIMENo
matchings
DAG DatabasePTIMEPTIMEPTIMENo
matchings
Cyclic DatabasePTIMEPTIMEcoNPcoNP
70
Input-Output Complexity of Query Evaluation for the Semiflexible Semantics
• Next slide summarizes results about the input-output complexity – Polynomial for a dag query and a tree database
(or simpler cases)• Rather difficult to prove, even when the query is a
tree, since there is no local test for verifying that mappings are semiflexible matchings
– Exponential lower bounds for other cases
71
I/O Complexity for SF Semantics (lower bounds are for non-emptiness)
Query / Database
Path Query
Tree Query
DAG Query
Cyclic Query
Path DatabasePTIMEPTIMEPTIME
Result is empty
Tree DatabasePTIMEPTIMEPTIME
Result is empty
DAG Database
NP-Complete
NP-Complete
NP-Complete
Result is empty
Cyclic Database
NP-Complete
NP-Complete
NP-Hard
(in P2)
NP-Hard
(in P2)
Data Complexity is Polynomial in all Cases
72
Query Evaluation for the Flexible Semantics
• The database is replaced with a relationship graph which is a graph, such that– The nodes are the nodes of the database– Two nodes are connected by an edge if there is
a path between them in the database (the direction of the path is unimportant)
• The query is evaluated under rigid semantics w.r.t. the relationship graph
73
I/O Complexity of Query Evaluationfor the Flexible Semantics
• Results follow from a reduction to query evaluation under the rigid semantics
• Tree query– Input-Output complexity is polynomial
• DAG query– Testing for non-emptiness is NP-Complete
74
Query Containment
• Q1 Q2 if for all database D,
the set of matchings of Q1 w.r.t. to D
is contained in
the set of matchings of Q2 w.r.t. to D
• We assume that– Both queries have the same set of variables
75
Complexity of Query Containment
• Under the semiflexible semantics, Q1 Q2 iff the identity mapping from the variables of Q2 to the variables of Q1 is a semiflexible matching of Q2 w.r.t. Q1
• Thus, containment is – in coNP when Q1 is a cyclic graph and Q2 is
either a dag or a cyclic graph– in polynomial time in all other cases
• Under the flexible semantics, query containment is always in polynomial time
76
Database Equivalence
• D1 and D2 are equivalent if for all queries Q,
the set of matchings of Q w.r.t. to D1
is equal to
the set of matchings of Q w.r.t. to D2
• Both databases must have the same set of objects and the same root
77
Complexity of Database Equivalence
• For the semiflexible semantics, deciding equivalence of databases is– in polynomial time if both databases are dags– in coNP if one of the databases has cycles
• For the flexible semantics, deciding equivalence of databases is polynomial in all cases
78
Database Transformation1
2 3 4
MDB
ActorActor
Movie
6 8
Actor
Movie Movie
The databases are equivalent under boththe flexible and semiflexible semantics
Hook Star Wars
DustinHoffman
HarrisonFord
MarkHamill
A DAG has become a TREE!
1
2 3 4
MDB
Actor Actor
Movie
6 8
Actor
Movie
DustinHoffman
Hook
HarrisonFord
Star Wars
MarkHamill
79
Transforming a Database into a Tree
• Reasons for transforming a database into an equivalent tree database:– Evaluation of queries over a tree database is
more efficient– In a graphical user interface, it is easier to
represent trees than DAGs or cyclic graphs– Storing the data in a serial form (e.g., XML)
requires no references
80
Transformation into a Tree
• There are algorithms for– Testing if a database can be transformed into an
equivalent tree database, and– Performing the transformation
• For the semiflexible semantics– The algorithms are polynomial
• For the flexible semantics– The algorithms are exponential
81
Implementing Flexible Queries
• Flexible queries were implemented in SQL4X
• In an SQL4X query, relations and XML documents are queried simultaneously
• A query result can be either a relation or an XML document
82
QUERY AS RELATION
SELECT text(y) as director, text(v) as title
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of z
An SQL4X Query
r
y
x
z
MovieName
Director
vTitle
A query under theFlexible Semantics
83
QUERY AS RELATION
SELECT text(y) as director, text(v) as title
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of x
WHERE text(v) = ‘Star Wars’
An SQL4X Query
r
y
x
z
MovieName
Director
vTitle
A query under theFlexible Semantics
Constraints can be added
84
QUERY AS RELATION
SELECT text(x) as director, text(v) as title, Budget
FROM x Director of ‘MDB.xml’, y Name of x,
z Movie of x, v Title of x, FilmBudgets
WHERE text(v) = FilmBudgets.Title
An SQL4X Query
r
y
x
z
MovieName
Director
vTitle
A query under theFlexible Semantics
Relations and XML Documentscan be queried simultaneously
TitleBudget
……
……
A relation with dataabout film budgets
FilmBudgets
85
AgendaAgenda
Why is is difficult to query semistructured data?Why is is difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
86
Combining the Paradigms
• In oblivious querying: – The user does not have to know where data is
incomplete– The user does not have to know the exact
structure of the data
• The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined
87
Flexible Queries with Incomplete Answers
• A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph
• Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers
88
1
11
Movie Database
Movie
22 23 25 26
ActorActor
NameName
Title
31 33DustinHoffman
Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Movie
Director
StevenSpielberg
Director
12
r
y
x
z
u
Name
32
Name
34
2927
MovieName
Director
14 May 1944
Date of birth
35
vTitle
NameGeorgeLucas
Producer
Actor
w
Consider the case where Node 25
and Node 33 are removed
Consider the case where Node 25
and Node 33 are removed
25
Actor
Name
33DustinHoffman
Title
Hook
89
1
11
Movie Database
Movie
22 23 26
Actor
Name
TitleTitle
31Harrison Ford
1977StarWars
24Year
21
Actor
Name
30
Mark Hamill
Hook
Movie
Director
StevenSpielberg
Director
12
r
y
x
z
u
Name
32
Name
34
2927
MovieName
Director
14 May 1944
Date of birth
35
vTitle
NameGeorgeLucas
Producer
Actor
w
1
29
12
34
26
A Flexible matching which is also an
incomplete (maximal) matching
A Flexible matching which is also an
incomplete (maximal) matching
u
NULL w
NULL
90
AgendaAgenda
Why is is difficult to query semistructured data?Why is is difficult to query semistructured data?
Queries with incomplete answers (QwIA)Queries with incomplete answers (QwIA)
Flexible queries (FQ)Flexible queries (FQ)
Oblivious querying = QwIA + FQOblivious querying = QwIA + FQ
Using QwIA and FQ for information integration Using QwIA and FQ for information integration
91
Full Disjunction
• Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples
• Dangling tuples are padded with nulls
• Only maximal tuples are retained in the full disjunction (as in the case of QwIA)
92
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
a-idnamedate-of-birth
1Woody Allen1/12/1935
2Bruce Willis19/3/1955
3Julia Roberts28/10/1967
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-ina-idm-id
11Actors-that-Directed
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
The Full Disjunction of the Given Relations
93
The Full Disjunction of the Given Relations
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English
The full disjunction does not include subsumed tuples
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
This tuple will notbe in the full disjunction
94
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
a-idnamedate-of-birth
1Woody Allen1/12/1935
2Bruce Willis19/3/1955
3Julia Roberts28/10/1967
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-ina-idm-id
11Actors-that-Directed
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
The Full Disjunction of the Given Relations
m-idtitleyearlanguagea-idnameDate-of-birthrole
4Fantasia1940English3Julia Roberts28/10/1967
The full disjunction does not include tuples that are based on Cartesian Product rather than join
95
In the Full Disjunctionof a Given Set of Relations:
Every tuple of the input is a partof at least one tuple of the output
Tuples are joined as in a naturaljoin, padded with null values
The result includes only“maximal connected portions”
96
Motivation for Full Disjunctions
• Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94]
• Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]
97
Computing Full Disjunctionsfor γ-acyclic Relation Schemas
• Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic
• Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic
98
Weak Semantics GeneralizesFull Disjunctions
• Relations can be converted into a semistructured database
• The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics
We have developed an algorithm that uses thisgeneralization to compute full disjunctions in
polynomial time under I/O complexity, even when the relation schemas are cyclic
We have developed an algorithm that uses thisgeneralization to compute full disjunctions in
polynomial time under I/O complexity, even when the relation schemas are cyclic
99
Generalizing Full Disjunctions
• In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join)
• We can generalize full disjunctions to support constraints that are not merely equality among attributes
100
Example
Movies (m-id, title, year, language, location)
Actors (a-id, name, date-of-birth)
Acted-in (a-id, m-id, role)
Actors-that-Directed (a-id, m-id)
Movies (m-id, title, year, language, location)
Actors (a-id, name, date-of-birth)
Acted-in (a-id, m-id, role)
Actors-that-Directed (a-id, m-id)
Historical-Events (name, date, description)
Historical-Sites (Country, State, City, Site)
Historical-Events (name, date, description)
Historical-Sites (Country, State, City, Site)
The date of the historical event is a date in the year whenthe movie was released
The filming location is near the historical site
101
Another Way of Generalizing Full
Disjunctions: Use OR-Semantics
• OR-semantics is used rather than weak semantics when tuples are joined
• This relaxes the requirement that every pair of tuples should be join consistent
• Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent
102
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-idenamecitydept
-no
dept
-no
dnamebuildingbuildingcitystreet
007James BondLondon66MI-610
6MI-61010LiverpoolKing
Example
The Full Disjunction
103
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-idenamecitydept
-no
dept
-no
dnamebuildingbuildingcitystreet
007James BondLondon66MI-61010LiverpoolKing
Example
The Full Disjunction under OR-Semantics
104
Integrated Relation
Data Source Data Source Data Source
Information Integration from Heterogeneous Sources
Query
Relation
Query
Relation
Query
Relation
105
Integrated Relation
Data Source Data Source Data Source
Query
Relation
Query
Relation
Query
Relation
We use queries that combine flexible semanticsand weak semantics:
-The queries are insensitive to changes in the data- Easy to formulate the query
106
Integrated Relation
Data Source Data Source Data Source
Query
Relation
Query
Relation
Query
Relation
The integration of the relations is done witha full disjunction of the computed relations
107
Conclusion
• Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases– Querying the database even when the user is
oblivious to the structure of the database– Queries are insensitive to variations in the
structure of the database
108
Conclusion (continued)
• Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases– Querying the database even when the user is
oblivious to missing data– Queries return maximal answers rather than
complete answers
109
Conclusion (continued)
• The two paradigms of flexible queries and queries with maximal answers can be combined
• The combination of the paradigms can facilitate integration of information from heterogeneous sources
110
Conclusion (continued)
• Full disjunctions can be computed using queries in weak semantics
• Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints
111
Thank YouThank You
Questions?Questions?
top related