1 omar benjelloun - new bases for new data new bases for new data omar benjelloun stanford...

51
1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

Upload: jason-godfrey-jennings

Post on 18-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

3 Omar Benjelloun - New Bases for New Data … but data has changed Data is distributed, behind applications, dynamically changing Data is heterogeneous Data may be uncertain Today Data is stored in relational databases (or XML) Techniques for data integration, data exchange … Lots of code Traditional Database Management Systems (DBMS’s) are too rigid New characteristics should be represented in the data New bases are needed foundations (models and languages) Processing and optimization techniques

TRANSCRIPT

Page 1: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

1Omar Benjelloun - New Bases for New Data

New Bases for New Data

Omar BenjellounStanford University January 27th, 2006

Page 2: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

2Omar Benjelloun - New Bases for New Data

Relational databases are great

A simple, understandable model for data

High-level, declarative language for queries and updates: SQL

Efficient optimization techniques

Relational databases are the cornerstone of the management of homogeneous, regular, exact, centralized information

Boss Emp ManagerJoe Bill

Bill Steve

Page 3: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

3Omar Benjelloun - New Bases for New Data

… but data has changed

• Data is distributed, behind applications, dynamically changing• Data is heterogeneous• Data may be uncertain

Today• Data is stored in relational databases (or XML) • Techniques for data integration, data exchange• … Lots of code

Traditional Database Management Systems (DBMS’s) are too rigid

New characteristics should be represented in the data

New bases are needed• foundations (models and languages) • Processing and optimization techniques

Page 4: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

4Omar Benjelloun - New Bases for New Data

Applications

Information integration• Data is distributed on multiple heterogenous, independent sources• Conflicting information from the sources: inconsistency, uncertainty• Varying and evolving reliability of sources• Where data came from can be critical information

Scientific data management

Receptor (e.g., sensor) data management

Data cleaning (entity resolution)

And many others…

Page 5: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

5Omar Benjelloun - New Bases for New Data

Agenda

Distributed and dynamic data: Active XML• A “glue” language to connect data and programs• XML documents with embedded calls to Web services• Distributed interactions through the exchange of AXML data• Techniques to query and control the exchange of AXML data

Uncertain data: ULDB’s• An extension of the relational model with uncertainty and lineage• Efficient query evaluation• Computing probabilities

Conclusion

Page 6: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

6

Omar Benjelloun - New Bases for New Data

Active XML

Page 7: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

7Omar Benjelloun - New Bases for New Data

Distributed data managementInformation is everywhere

services

XML XML

services

XML XMLXML XML

services

XML

services

XMLInternet

Webservice

Webservice

Data warehousesDatabasesWeb sitesPC, PDA, cell phones, home appliances, cars…

Page 8: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

8Omar Benjelloun - New Bases for New Data

The golden triangle of distributed data management

XML a standard for data representation & exchange

• Extensible Markup Language• Labeled ordered trees• Rich types: XML Schema

Query languages• XPath, XQuery

Web services • Standards for distributed computing

XQuery XPath

XML

SOAPWSDL

Page 9: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

9Omar Benjelloun - New Bases for New Data

What is Active XML (AXML)?

AXML is a declarative language

for distributed information management

and

an infrastructure to support this language,

in a peer-to-peer framework.

Page 10: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

10Omar Benjelloun - New Bases for New Data

Active XML documents

XML documents with embedded calls to Web services

Intensional • Some of the data is given explicitly • Some is given intensionally

(i.e. the means to acquire data when needed are given)

Dynamic• If the external sources change, the same document will provide

different information• Reaction to world changes

Page 11: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

11Omar Benjelloun - New Bases for New Data

Not a new idea in databases, nor on the Web

Mixing calls to data is an old idea• Procedural attributes in relational systems• Basis of Object-oriented Databases

In Web programming• Sun’s JSP, PHP+MySQL

Calls to Web services inside documents• Macromedia FLEX, Apache Jelly, Microsoft XAML

What is new is the exploitation of the idea…

Page 12: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

12Omar Benjelloun - New Bases for New Data

Web services in brief

A number of standards• XML• SOAP: Exchange of messages between applications• WSDL: Description of service interfaces (e.g. input/output types)• UDDI: Advertisement and discovery of services• … other proposed standards (choreography, security, etc.)

For us: means to provide, invoke and describe remote functions with XML input/output.

They make AXML documents universally understandable.

Page 13: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

13Omar Benjelloun - New Bases for New Data

A sample AXML document<?xml version=“1.0” ?><newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call></newspaper>

GetTemp

city

“Paris”

newspaper

titledate

“06/10/2003”“Le Monde”

GetEvents

“Exhibits”

AXML documents may contain calls:• to any existing Web services

(e-bay.net, google.com…)• to any AXML Web services

(to be defined)

Page 14: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

14Omar Benjelloun - New Bases for New Data

Materialization

• Replacing the call by its result is not the only option• Calls are not necessarily RPC-style synchronous invocations

<?xml version=“1.0” ?><newspaper> <title>Le Monde</title> <date>06/10/2003</date> <call svc=“Yahoo.GetTemp”> <city>Paris</city> </call> <call svc=“TimeOut.GetEvents”> exhibits </call></newspaper>

GetTemp

city

“Paris”

newspaper

titledate

“06/10/2003”“Le Monde”

GetEvents

“Exhibits”

Y!Y!

temp

“16°C”

SOAP call

<temp>16°C</temp>

Page 15: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

15Omar Benjelloun - New Bases for New Data

AXML Web services

Parameters: AXML data

Result: AXML data

Distribute computations: by sending as parameters data containing service calls, one can delegate some work to other peers.

Partial computations: by returning data containing service calls, one can give to the receiver the control of these calls.

Great flexibility

Page 16: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

16Omar Benjelloun - New Bases for New Data

Distributed interactions

Page 17: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

17

Omar Benjelloun - New Bases for New Data

Exchanging Active XML

Page 18: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

18Omar Benjelloun - New Bases for New Data

To call or not to call ?

GetEvents

“Exhibits”

newspaper

title date

“Le Monde”“06/10/2003”

GetTemp

city

“Paris”

temp

“16°C”

Y!Y!

Materialization can be performed by the sender, before sending a document… or by the receiver, after receiving it.

GetEvents

“Exhibits”

newspaper

title date

“Le Monde”“06/10/2003”

GetTemp

city

“Paris”

temp

“16°C”

Page 19: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

19Omar Benjelloun - New Bases for New Data

Why control the materialization of calls?

For added functionality, e.g. • Intensional data allows to get up-to-date information.

For security reasons or capabilities, e.g.• I don’t trust this Web service/domain,• I don’t have the right credentials to invoke it, • It costs money,• Maybe the receiver doesn’t know Active XML!

For performance reasons, e.g.• A proxy can invoke all the services on behalf of a PDA.

… and many more reasons you can think of!

Page 20: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

20Omar Benjelloun - New Bases for New Data

We extend XML Schema, with intensional types: XMLSchemaint

How to control it? Using types

Static analysis algorithms use signatures of services: WSDLint

... ...

r

......

...

... ...

gfq

...

CapabilitiesACLCost...

Sender

dataexchangeSchemaf q

g

CapabilitiesACLCost...

Receiver

gg

g

g

gg

q

q

q

f

fr

r

Page 21: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

21Omar Benjelloun - New Bases for New Data

Data:newspaper = title.date.(GetTemp|temp).(GetEvents|exhibit*)

title = data

date = data

temp = data

city = data

exhibit = title.(GetDate|date)

Functions:GetTemp(city) -> temp

GetEvents(data) -> (exhibit|performance)*

GetDate(title) -> date

The extended schema language

Rewriting: replace call(s) by an arbitrary output of the service.

To simplify, we use here a DTD-like syntax

GetTemp

city

“Paris”

newspaper

titledate

“06/10/2003”“Le Monde”

GetEvents

“Exhibits”

Page 22: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

22Omar Benjelloun - New Bases for New Data

Rewritings

The Goal:Given • an AXML document d • a schema s, Can we rewrite d so that it matches s?

Safe rewriting: one that for sure leads to s(we know without making any call)

Possible rewriting: one that may lead to s (depending on the answers of services)

Page 23: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

23Omar Benjelloun - New Bases for New Data

Difficulties

Infinite search space• Vertical• Horizontal

Main problem • The result of a Web service call is unknown• We just know a signature (input/output types)

We want a very efficient solution

Foundations of the problem • String & tree automata, • with existential and universal transitions.

Page 24: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

24Omar Benjelloun - New Bases for New Data

Results

The general problem is undecidable [MSS03]

Restrictions on the considered rewritings• Left-to-right: No “going back and forth”• K-depth: bound on the nesting of function calls (Search space still infinite but finitely representable)

Under these restrictions• We have algorithms to find safe/possible rewritings.• They are PTIME (for deterministic schemas).• We can also do it between schemas.

Implementation• demo at VLDB 2003 (customizable news syndication)

Page 25: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

25Omar Benjelloun - New Bases for New Data

Safe rewriting algorithm (flavor)Build an FSA that accepts all k-depth rewritings of the initial word.

Build an FSA that recognizes the complement of the target type.

GetEvents

1wA

q1title

q6

dateq2 q3GetTemp

q0 q4

q5

q7

exhibit

performance

temp

p0 p1title p2date p3temp p4GetEvents p6*

p5

exhibit

exhibit

*

* * * *

*

A

Page 26: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

26Omar Benjelloun - New Bases for New Data

Safe rewriting algorithmCompute the intersection of these languages:

A smart marking determines whether a safe rewriting exists.Then run the word on the marked automaton to find an actual rewriting.Optimizations: lazy construction of the automata

parallel evaluation of calls

q0,p0 q1,p1 q2,p2 q3,p3 q4,p4

q6,p3q5,p2

q3,p6q7,p6

q4,p6

q7,p6 q7,p3 q4,p3

q7,p5 q4,p5

title date

temp

GetEvents

GetEventsperformance

performance

GetTemp

performanceexhibit

exhibit

exhibit

exhibit

AAA kw ×=×

Page 27: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

27

Omar Benjelloun - New Bases for New Data

Querying Active XML

Page 28: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

28Omar Benjelloun - New Bases for New Data

Querying AXML Data

Given a (tree pattern) query:/newspaper[temp > 18°C]/exhibits//exhibit[location=“Le Louvre”]

Materialize the document?

Call only the services that may contributedata to the query answer.

The problem: Lazy evaluation of service callsTo call or not to call, this time when evaluating a query

GetTemp

city

“Paris”

newspaper

titlegetDate

“Le Monde”

GetEvents

“Exhibits”

exhibits

GetExhibits

“Paris”

City

temp

“19°C”

Page 29: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

29Omar Benjelloun - New Bases for New Data

Lazy evaluation

Difficulties:• Calls can be found everywhere in the document• May appear dynamically (as a result of previous calls)• May become (ir)relevant due to previous invocations• Need to take signatures of calls into consideration

A possible approach: modify the query processor• Top-down evaluation• Trigger the calls found on the way• Not so great:

– Computation is blocked– Optimization opportunities are lost

Page 30: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

30Omar Benjelloun - New Bases for New Data

NFQ’s

Given a query to evaluate:

Derive a set of

“node-focused” queries (NFQ),

that find the relevant calls

when evaluated on the document.

Need to be reevaluated, as the document evolves!

newspaper

temp

> 18°C

exhibitsexhibitlocation

“Le Louvre”

newspaper

temp

> 18°C

exhibits

*

**Etc.

Page 31: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

31Omar Benjelloun - New Bases for New Data

Optimizations

Service calls sequencing• Analysis of the relationship between calls (through the NFQ’s)• Layering, and parallelization inside each layer.

Filtering by type analysis• Match output types of services to the data expected by queries

“Pushing” queries to capable servicesAcceleration:

• Via relaxation:– NFQ approximation– Superset of the relevant calls

• Via a special access structure, similar to a DataGuide:– Restricted to paths that lead to service calls– Indexes the calls

Experimental assessment• 10x speed-up when combining optimizations

Page 32: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

32Omar Benjelloun - New Bases for New Data

There is more…

The AXML peer system • Manages persistent AXML documents • Provides AXML services • Open source

Language extensions to control the activation of calls

Continuous services

Theoretical foundations

…check out http://www.activexml.net

Page 33: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

33

Omar Benjelloun - New Bases for New Data

Uncertain data

Page 34: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

34Omar Benjelloun - New Bases for New Data

Basic Premise

Traditional relational DB• Every data item’s value must be exact• Every data item is in the database or not• Where data came from and how it evolves is not important

ULDB’s relax these constraints by making1. Data2. Uncertainty3. Lineage

all first-class interrelated concepts

Page 35: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

35Omar Benjelloun - New Bases for New Data

Previous work

Models for uncertainty• Labeled nulls, c-tables, probabilistic models,...

Trade-off between • expressiveness• Simplicity of representation, complexity of operations• We investigated this space in [DBHM06]

Models for lineage• In relational databases, data warehouses• Definition of lineage can be tricky for complex queries

First to consider lineage together with uncertainty

Page 36: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

36Omar Benjelloun - New Bases for New Data

Uncertainty

Possible worlds:

SAW Witness CarGranny VWCop Ford

Granny

BMW

Granny VWCop Ford

Granny BMWCop Ford

?

Cop Ford

x-tuple

alternate

maybeCop VW

Granny VWCop VW

Granny BMWCop VW Cop VW

Simple formalism • not complete• not closed under joins

Page 37: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

37Omar Benjelloun - New Bases for New Data

Lineage

SAW Witness CarGranny VWCop Ford

OWNS Suspect CarChris VWChris BMWMike VWMike Ford

witness, suspect

ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike

Page 38: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

38Omar Benjelloun - New Bases for New Data

ULDB’s

SAW Witness CarGranny VWCop Ford

OWNS Suspect CarChris VWChris BMWMike VWMike Ford

ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike

Granny

BMW ?

Granny Chris

??

?

Page 39: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

39Omar Benjelloun - New Bases for New Data

ULDB’s

SAW Witness CarGranny VWCop Ford

OWNS Suspect CarChris VWChris BMWMike VWMike Ford

ACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike

Granny

BMW

Granny Chris

?

??

?

Page 40: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

40Omar Benjelloun - New Bases for New Data

Properties

ULDB’s are simple• x-tuples: set of alternate tuples, with or without ‘?’ • lineage: associates with each alternate a set of alternates / external

symbols

ULDB’s are expressive • Complete: can represent any finite set of possible worlds (with lineage)• Simple implementation of monotonic queries, with correct lineages• Natural probabilistic extension

ULDB’s are efficient• Query processing can use existing query optimizers• Tuple certainty/membership can be tested in polynomial time

Page 41: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

41

Omar Benjelloun - New Bases for New Data

Query processing

Page 42: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

42Omar Benjelloun - New Bases for New Data

Querying ULDB’s

D Q(D)

ULDB’s

Pos

sibl

e w

orld

s

D1, D2, …, Dn

Query semanticsQ(D1), Q(D2), …, Q(Dn)Q(Di): add query result

as new relation and lineage to Di

Algorithm

Relational databases(with lineage)

Page 43: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

43Omar Benjelloun - New Bases for New Data

Algorithm

SAW Witness CarGranny VWCop Ford

OWNS Suspect CarChris VWChris BMWMike VWMike Ford

witness, suspectACCUSES Witness SuspectGranny ChrisGranny MikeCop Mike

BMWGrannyFordKid

Granny ChrisKid Mike

??

??

BMWGranny ?FordKid ?

MikeKid

Page 44: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

44Omar Benjelloun - New Bases for New Data

Properties

Efficient algorithm• Query processing phase can use standard query optimizer• Lineages are easy to propagate • “Grouping” phase requires a single pass on the result

Initial prototype• represents a ULDB as a relational DB• uses simple query rewriting techniques

Algorithm works for any monotonic query (including SPJU queries)

Page 45: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

45

Omar Benjelloun - New Bases for New Data

Probabilities

Page 46: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

46Omar Benjelloun - New Bases for New Data

Probabilistic ULDB’s

Semantics: As before, with a probability for each possible world

Without lineages• Alternates of the same x-tuple correspond to disjoint events• Alternates of different x-tuples correspond to independent events

Lineages • Capture correlations• Help propagate probabilities for query results

SAW Witness CarGranny VW

Cop Ford

Granny BMW ?

Cop VW

0.2 0.5

0.3 0.7

0.3

Page 47: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

47Omar Benjelloun - New Bases for New Data

Probabilistic query answering

Compute queries as before

Compute probabilities on demand• Traverse lineages transitively to the leaves• Combine probabilities of reached alternates

Optimizations: memoize probabilities, efficiently detect ‘closest independent ancestors’

?

?

??

?

0.2 0.3 0.4 0.1 0.3 0.5 1

Page 48: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

48Omar Benjelloun - New Bases for New Data

Future work

Richer queries • Duplicate elimination, difference, aggregation• Supported through new kinds of lineages (e.g., disjunctive, negative)• Querying the uncertainty and the lineage

More operations• Updates (and their lineage), close to versioning• “Uncertain operations”, e.g., entity resolution, inconsistency repairs

More optimization techniques

More theory

Page 49: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

49

Omar Benjelloun - New Bases for New Data

Conclusion

Page 50: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

50Omar Benjelloun - New Bases for New Data

New “Bases” for new data

The database way• Simple models• Declarative languages• Optimization techniques

… for new features of data• Distribution and decentralization: Active XML• Uncertainty and lineage: ULDB’s

There are more challenges• Real-world side effects, semantic reasoning

and strong requirements• security, privacy, personalization

Big challenge: Doing it all in a coherent way• One “big” model?• Integration of models?

Page 51: 1 Omar Benjelloun - New Bases for New Data New Bases for New Data Omar Benjelloun Stanford University January 27th, 2006

51

Omar Benjelloun - New Bases for New Data

Merci