1 p2p data management, 2006, s. abiteboul1/89 information management in p2p serge abiteboul...

89
1 P2P Data Management, 2006, S. Abite boul 1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

Upload: jada-quinlan

Post on 27-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

1

P2P Data Management, 2006, S. Abiteboul 1/89

Information Management in P2PSerge AbiteboulINRIA-Futurs and Univ. Paris 11

Page 2: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

2

Introduction

Page 3: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

3

P2P Data Management, 2006, S. Abiteboul 3/89

Success stories at the time of the Internet bubble

Google: management of Web pages

Mapquest: management of maps

Amazone: book catalogue

eBay: product catalogue

Napster (emule, bearshare, etc.): music database

Flickr: picture database

Wikipedia: dictionary

del.icio.us: annotations

In France: Meetic: dating databaseKelkoo: comparative shopping

They are all aboutpublishing some database

Page 4: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

4

P2P Data Management, 2006, S. Abiteboul 4/89

The trend is towards peer-to-peerand interactivity

P2P: A large and varying number of computers cooperate to solve some particular task without any centralized authority

Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network

seti@home; kazaa; cabal

Switch from centralized servers to communities and syndication

Interaction and Web 2.0

Motivations: Social, organizational

Page 5: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

5

P2P Data Management, 2006, S. Abiteboul 5/89

Information management in a P2P network

Private terminology: data ring

Information is heterogeneous, distributed, replicated, dynamic

Which info: Data + meta-data + knowledge + services

Peers are heterogeneous, autonomous and possibly mobile• From sensors to PDA to mainframe

Typically very large number of peers

Variety of requirements: QoS, performance, security, etc.

Page 6: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

6

P2P Data Management, 2006, S. Abiteboul 6/89

Acknowledgement

Xyleme: Scalable XML warehousing

Sophie Cluet, Guy Ferran (Xyleme) & many others

ActiveXML: Language for P2P data management

Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others

KadoP: P2P scalable XML indexing

Ioana Manolescu, Nicoleta Preda & others

Data Ring: Infrastructure for P2P data management

Alkis Polyzotis (UC Santa Cruz)

Page 7: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

7

P2P Data Management, 2006, S. Abiteboul 7/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)

4. Indexing in P2P (KadoP)

5. Conclusion

Goal of the tutorial: present issues and technology on p2p information management

Warning: it is very biased – it is not a survey

Page 8: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

8

P2P Data Management, 2006, S. Abiteboul 8/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)

4. Indexing in P2P (KadoP)

5. Conclusion

Page 9: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

9

1. Introduction – the data ring

Page 10: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

10

P2P Data Management, 2006, S. Abiteboul 10/89

The information in a data ring

Data: tuples, collections, documents, relations…

Services: data sources, possibly some processing

Meta-data about resources: attribute/values pairs, annotations

Ontologies to explain data and metadata

View definitions

Data integration information, e.g., mappings between ontologies

Physical data: Indices and materialized views

Page 11: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

11

P2P Data Management, 2006, S. Abiteboul 11/89

Functionalities of the data ring

Storage, persistence, replication

Indexing, caching, querying, updating, optimization

Schema management, access control

Fault tolerance, self tuning, monitoring

Resource discovery, history, provenance, annotations, multi-linguism,

Semantic enrichment, uncertain data

Each functionality may be achieved by a peer or by the network

Page 12: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

12

P2P Data Management, 2006, S. Abiteboul 12/89

And now, what is a peer?

A mainframe database

A file system

Web server

A PC

A PDA

A telephone

A sensor

A home appliance

A car

A manufacturing tool

A telecom equipment

A toy

Another data ring

Any connected device or software with some information to share

A net address and some names of resources (e.g. document or service)

Page 13: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

13

P2P Data Management, 2006, S. Abiteboul 13/89

Advantages and disadvantages of P2P

Scaling

Performance• Optimization of parallelism• Avoid bottleneck• Replication

Availability • Replication

Cost• Avoid the cost of server• Share operational cost

Dynamicity• add/remove new data sources

Complexity

Performance• Cost for complex queries

• Communication cost

Availability• Peers can leave

Consistency maintenance • Difficult to support transaction

Quality• Difficult to guarantee quality

Page 14: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

14

P2P Data Management, 2006, S. Abiteboul 14/89

Crash course on Web standards

Data exchange format = XML • Labeled ordered trees

• Its main asset = XML schema

• There is much more

Distributed computing protocol = Web services• SOAP Simple Object Access Protocols

• WSDL Web Service Definition Language

• UDDI Universal Description, Discovery and Integration

• BPEL Business Process Execution Language

Query languages = XPath and XQuery• Declarative query language for XML + full-text + update language

Knowledge representation = Owl or RDF/S

SOAPWSDL

XML

XqueryXpath

OwlRDFS

Page 15: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

15

Information used to live in islands but with the Web, this is changing:

uniform access to information… …the dream for distributed data management

Page 16: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

16

P2P Data Management, 2006, S. Abiteboul 16/89

Do you like the standards?

It is the wrong question!

Correct questions: What can you do with it? What is missing?

Is Xquery the ultimate query language for the Web? No

• It is a language for querying centralized XML

• We will see what it is missing

We will not talk much about semantics

Page 17: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

17

P2P Data Management, 2006, S. Abiteboul 17/89

Automatic and distributed management of the data ring

No centralized server

No information administrator (no info manager)

Most users are non-experts• E.g., scientists

Requirements• Ease of deployment (zero-effort)

• Ease of administration (zero-effort)

• Ease of publication (epsilon-effort)

• Ease of exploitation (epsilon-effort)

• Participation in community building notably via annotations

Happy database admin

Page 18: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

18

P2P Data Management, 2006, S. Abiteboul 18/89

What should be made automatic

Self-statistics from the monitoring of the data ring• In particular, define the statistics that are needed*

Self-tuning based on the self-statistics• Choose the most appropriate organization

• Decide to install access structures: indexes, views, etc.

• Control replication of data and services

Self-healing• Recovery from errors

• E.g., replacement of a failing Web service

And automatic file management

Page 19: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

19

P2P Data Management, 2006, S. Abiteboul 19/89

Any hope?

Technology exists (database self-tuning, machine learning, etc.)

But self-tuning for databases has advanced very slowly

Why can this work?

1. There is no alternative (for db, this was just a cool gadget)

2. KISS (keep it simple stupid!)

3. The power of parallelism

This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

Page 20: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

20

P2P Data Management, 2006, S. Abiteboul 20/89

Distributed access control

Goal: Control access to ring resources

Access to resources is based on access rights (ACL)

Who is controlling ACLs?

A node manages ACLs for a collection of distributed resources• Easy but against the spirit and possible bottleneck

The network manages access control• Anybody can get the data

• The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes

• Some techniques exist

Page 21: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

21

P2P Data Management, 2006, S. Abiteboul 21/89

Monitoring

What is monitored?

Web service calls and database updates

The Web– Web pages– RSS feed

What is produced?

A stream of events– As a continuous service– As a RSS feed– As a Web site/page

Info-surveillance

Self-statistics and tracing

Basis for error diagnosis

Page 22: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

22

P2P Data Management, 2006, S. Abiteboul 22/89

Streams are everywhere

In query processing

In indexing (KadoP)

In recursive queries (AXML-QSQ)

In messaging, monitoring and pub/sub

That is why we will use an algebra over streams of trees and not simply an algebra over trees

Page 23: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

23

P2P Data Management, 2006, S. Abiteboul 23/89

Example: Edos distribution system

A system for the management of Linux distribution

Joint work with Mandriva Software and U. Tel Aviv

Community of open-source developers: thousands

System releases: about 10 000 software packages + metadata

Functionalities• Query the metadata

• Query subscription

• Retrieve packages

• Publish a new release or update an existing one

Page 24: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

24

P2P Data Management, 2006, S. Abiteboul 24/89

Exemple: WebContent

WebContent: an ANR platform for the management of web content

Web surveillance• Business, technical, web watching…

Participation of Gemo• WP3: knowledge

• WP5: P2P content management

Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc.)

Page 25: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

25

P2P Data Management, 2006, S. Abiteboul 25/89

Taxonomy of such applications

Parameters• Number of peers and quantity of data

• How volatile the peers are

• The query/update workload

• The functionalities that are desired

Edos: peers and documents in thousands, mostly append for updates, peers not too volatile

An extreme: Google search engine in P2P for billions of documents using millions of hyper volatile peers

Mostly interested in the first case

Page 26: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

26

P2P Data Management, 2006, S. Abiteboul 26/89

Thesis

XQuery is fine for local XML processing and publishing

Not sufficient for distributed data management

The success of the relational model, i.e., of tables on a server :1. A logic for defining tables

2. An algebra for describing query plans over tables

By analogy, we need for trees in a P2P system1. A logic for defining distributed tree data and data services

2. An algebra for describing query plans over these

Proposal: ActiveXML logic and algebra

Page 27: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

27

P2P Data Management, 2006, S. Abiteboul 27/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)

4. Indexing in P2P (KadoP)

5. Conclusion

Page 28: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

28

2. Active XML:a logic for distributed data management

Page 29: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

29

P2P Data Management, 2006, S. Abiteboul 29/89

The basis

AXML is a declarative language for distributed information management and an infrastructure to support the language in a P2P framework

Simple idea: XML documents with embedded service calls

Intensional data• Some of the data is given explicitly whereas for some, its definition (i.e.

the means to acquire it when needed) is given

Dynamic data• If the data sources change, the same document will provide different

information

Page 30: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

30

P2P Data Management, 2006, S. Abiteboul 30/89

Example(omitting syntactic details)

<resorts state=‘Colorado’> <resort> <name> Aspen </name> <sc> Unisys.com/snow(“Aspen”) </sc> <depth unit=“meter”>1</depth> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts>

May contain callsto any SOAP web service :

• e-bay.net, google.com…to any AXML web services

• to be defined

Page 31: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

31

P2P Data Management, 2006, S. Abiteboul 31/89

Marketing Philosophy

Manon: What’s the capital of Brazil?Dad: Let’s ask Wikipedia.com!

Manon: How do I get a cheap ticket to Galapagos?Dad: Let’s place a subscription on LastMinute.com!

Manon: What are the countries in the EC?Dad: France, Germany, Holland, Belgium, and hum… Let’s ask YouLists.com for more!

Active answer = intensional and dynamic and flexible

Embedding calls in data is an old idea in database

Page 32: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

32

P2P Data Management, 2006, S. Abiteboul 32/89

Active XML peer

Peer-to-peer architecture

Each Active XML peer • Repository: manages Active XML data • Web client: calls the services inside a document

• Web server: provides (parameterized) queries/updates over the repository as web services

Exchange of AXML instead of XML

AXMLpeerso

ap

Page 33: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

33

P2P Data Management, 2006, S. Abiteboul 33/89

What is an AXML peer?

PC• Now open source ObjectWeb – queries in OQL

Peer on a mass storage system• eXist (open source XML database) queries in XQuery• Xyleme queries in XyQL

PDA or cell phone• Persistence in file system and XPATH

On going: the entire network• Data is stored in a P2P network - KadoP

More: java card, a relational database…

Page 34: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

34

P2P Data Management, 2006, S. Abiteboul 34/89

A key issue: call activation

When to activate the call?• Explicit pull mode: active databases

• Implicit pull mode: deductive databases

• Push mode: query subscription

What to do with its result?

How long is the returned data valid? • Mediation and caching

Where to find the arguments?• Under the service call: XML,XPATH or a service call

Page 35: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

35

P2P Data Management, 2006, S. Abiteboul 35/89

Hi John, what is the phone number of the Prime Minister of France?• Find his name at whoswho.com then look in the phone dir• Look in the yellow pages for deVillepin’s in phone dir of www.gov.fr• (33) 01 56 00 01

Another key issue: what to send?

Send some AXML tree t• As result of a query or as parameter of a call

The tree t contains calls, do we have to evaluate them? • If I do, I may introduce service calls, do we have to evaluate all

these calls before transmitting the data?

Page 36: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

36

P2P Data Management, 2006, S. Abiteboul 36/89

Active XMLcool idea – complex problems

Blasphemous claim:

Active XML is the proper paradigm for data exchange!

Not XML + not XQuery

Brings to a unique setting

distributed db, deductive db, active db, stream data

warehousing, mediation

This is unreasonable? Yes!

Plenty of works ahead… to make it work

But first, the algebra

Page 37: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

37

P2P Data Management, 2006, S. Abiteboul 37/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)• Query processing

• Query optimization

4. Indexing in P2P (KadoP)

5. Conclusion

Page 38: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

38

3. Active XML algebra

Page 39: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

39

P2P Data Management, 2006, S. Abiteboul 39/89

Motivation

Relational model: centralized tables

optimization: algebraic expression and rewriting

Active XML model: distributed trees

optimization: algebraic expression and rewriting

Distributed query optimization based on algebraic rewriting of Active XML trees

Based on experiences with AXML optimization

Page 40: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

40

P2P Data Management, 2006, S. Abiteboul 40/89

Active XML peers

We focus on positive AXML • Set-oriented data

• Positive/monotone services

Services = tree-pattern-query-with-join queries

Services produce streams • Optimized by a local query optimizer

• Evaluated by a local query processor

• Out of our scope

π

join

π

outputstream

inputstream

inputstream

Localqueryprocessing

Page 41: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

41

P2P Data Management, 2006, S. Abiteboul 41/89

The problem

An AXML system • A set of peers

• For each peer a set of documents and services

• Extensional data is distributed

• Intensional data (knowledge) is distributed

– Defined using query services (TPQJ queries)

– These services are generic: any peer can evaluate a query

A query q to some peer

Evaluate the answer to q with optimal response time

Page 42: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

42

P2P Data Management, 2006, S. Abiteboul 42/89

(AXML) algebraic expressions:

AXML algebra

AXML logic

l

E1 E2 En…

s@p

E1 E2 En…d@p

send@p

#n@p’ E1

eval@p

E1

receive@p

E1

Each such expression lives at some peerIncludes the AXML trees

Page 43: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

43

P2P Data Management, 2006, S. Abiteboul 43/89

Algebraic expressions annotations

Executing service call:

Terminated service call:

Subtletyq@p(5): definition of intensional data

eval(q@p(5)): request to evaluate it; during query optimization

q@p(5): query is being evaluated; during query processing

q@p(5): query evaluation is complete

Page 44: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

44

P2P Data Management, 2006, S. Abiteboul 44/89

Evaluation rules: local rules

for l ≠ sc, s ≠ send, receive

l

t1 t2 tn…

eval@p l

eval@p

t1

eval@p

t2

eval@p

tn…

tn

s@p

t1 t2 …

eval@p ●s@p

eval@p

t1

eval@p

t2

eval@p

tn…

→ E1

eval@p

eval@p E1

eval@p→

Page 45: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

45

P2P Data Management, 2006, S. Abiteboul 45/89

Evaluation rules: transfer rules

Site p asks p’ to do the work and send the result to p

s@p’

t1 t2 …

eval@p

#x@p

s@p’

t1 t2 …

receive@p

#x@p

s@p’

t1 t2 …

eval@p’

newRoot()@p’

send@p’

#x@p

→ &

Page 46: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

46

P2P Data Management, 2006, S. Abiteboul 46/89

Evaluation rules: more transfer rules

When a query is evaluated, results appear

They are sent to the place that requested them

Also some rules for eof

→ &

#x@p

Z

send@p’

eval@p’

●s@p’

#x@p

Z

send@p’

eval@p’

●s@p’

#x@p

Z

Page 47: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

47

P2P Data Management, 2006, S. Abiteboul 47/89

Evaluation

Reminder: setting• An AXML system

• A request to evaluate query q at peer p – eval@p( q )

Rewrite the trees in peer workspaces until termination of the process

Results

For positive XML, this process converges… to a possibly infinite state

This process computes the answer to q

May be fairly inefficient: need for optimization!

Page 48: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

48

Optimization

More rewrite rules to evaluate a query more efficiently

Page 49: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

49

P2P Data Management, 2006, S. Abiteboul 49/89

Query optimization

Well-known optimization techniques for distributed data management

Pushing selections

Semijoin reducers

Horizontal, vertical, hybrid decomposition

Recursive query processing and query-subquery

Some “specific” AXML optimizations

Pushing queries over service calls

Lazy service call evaluation …

Optimizing subscription management

All are captured by the algebraic framework

Page 50: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

50

P2P Data Management, 2006, S. Abiteboul 50/89

Example: pushing selections

Same rule applies if d@p2 is replaced by a continuous query

q

eval@p

d@p2

q1@p

eval@p

@p2(q2@p2(d@p2))

→ q1

eval@p

(q2(d@p2))

→ q1@p

eval@p

eval@p2

(q2(d))

Suppose q ≡ q1((q2)):

Page 51: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

51

P2P Data Management, 2006, S. Abiteboul 51/89

Example: interleaving of processing and optimization

At peer i: di = ri di+1

Query at p1: (d1)

1. (d1) → (r1) (d2)eval@p1((r1) (d2)) → eval@p1((r1)) eval@p1((d2))eval@p1((r1)) → ●(r1) (starts streaming data)

2. (d2) → (r2) (d3); (r2) starts streaming data

3. (d3) → (r3) (d4) …

(r1) (r2) (r3) (r4)

Page 52: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

52

P2P Data Management, 2006, S. Abiteboul 52/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

E

eval@p1

#x@p1

eval@p2

send@p2

#x@p1

newRoot@p2()

eval@p1

#x@p1

send@p1

E

Page 53: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

53

P2P Data Management, 2006, S. Abiteboul 53/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

eval(E)

#x@p1

send@p2

#x@p1

newRoot@p2()

eval@p1

#x@p1

send@p1

eval@p2

E

Page 54: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

54

P2P Data Management, 2006, S. Abiteboul 54/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

eval(E)

#x@p1

newRoot@p2()#x@p1

send@p2

#x@p1 eval@p2

E

&

Page 55: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

55

P2P Data Management, 2006, S. Abiteboul 55/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

eval(E)

#x@p1

newRoot@p2()#x@p1

send@p2

#x@p1eval(E)

&

Page 56: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

56

P2P Data Management, 2006, S. Abiteboul 56/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

#x@p1

eval(E) →

newRoot@p2()#x@p1

send@p2

#x@p1eval(E)

&eval(E)

Page 57: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

57

P2P Data Management, 2006, S. Abiteboul 57/89

Transfer and load balancing rules

Peer p1 delegates the evaluation of E to p2

E

eval@p1

#x@p1

eval@p2

send@p2

#x@p1

newRoot@p2()

eval@p1

#x@p1

send@p1

E

Page 58: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

58

P2P Data Management, 2006, S. Abiteboul 58/89

Back to interleaved execution and optimization

Data transfers reducedMore work for p1: merging all the streams

(r1) (r2) (r3) (r4)

(r1) (r2) (r3) (r4) …

Hierarchical stream merging

… …

Repeated transfers

Page 59: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

59

P2P Data Management, 2006, S. Abiteboul 59/89

Example: Horizontal and vertical decomposition

A relation d over ABC that is split both horizontally and vertically d = (d1 d2) d3

d1 = B<5 (d') and d2 = B>=5 (d')

d', d1, d2 over AB and d3 over BC; each di is at a peer pi

Consider the query B=0@p(d) → B=0@p( (B<5 (d') B>=5 (d'))) d3@p3 )→ B=0 @p( d1@p1 d3@p3 )→ @p (#x@preceive(d1@p1), #y@preceive(d3@p3)) & send@p1(#x@p; B=0@p1(d1@p1) ) & send@p3(#y@p; d3@p3)

Page 60: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

60

P2P Data Management, 2006, S. Abiteboul 60/89

Common sub-expression elimination

eval@p(E), #x@preceive@p(E) →

eval@p(#x@p), #x@preceive@p(E)

eval@p

E

#x@p

E

receive@p

eval@p #x@p

E

receive@p#x@p→

& &

Page 61: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

61

P2P Data Management, 2006, S. Abiteboul 61/89

In spite of the two calls to s@p, the function is evaluated only once

Common sub-expression elimination

eval@p’

q@p’

s@p s@p

q@p’

#x@p’

s@preceive@p’

s@p

eval@p’

newRoot()@p

send@p

#x@p’ s@p

→ &

newRoot()@p’

send@p’

#y@p’ #x@p’

q@p’

#x@p’

receive@p’

s@p

#y@p’

receive@p’

s@p

newRoot()@p

send@p

#x@p’ s@p

→& &

Page 62: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

62

P2P Data Management, 2006, S. Abiteboul 62/89

Example: recursive query processing

Using a pseudo Datalog syntax

s1@p($x, $y) d2@p'($x, $z), s2@p'($z, $y)

s2@p'($x, $y) d1@p($x, $z), s1@p($y, $z)

After rewriting:

on p : #x@p receive@p(q1@p'(d2@p', s2@p') )

root@p send@p(#y@p', q2@p( d1@p, #x@p) )

on p' : root@p' send@p'(#x@p,

q1@p'(d2@p', #y@p' receive@p'(s2@p') ) )

Page 63: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

63

P2P Data Management, 2006, S. Abiteboul 63/89

q@any: where q is a query• Any peer that has some query processor for q can do it

f@any: where f is a processing service call• Example: decryption or gene comparison

q over a P2P collection

Generic and global services

eval@p

q @ coll

eval@p

q @→ index

q

eval@p

q@p1→eval@p

q@p2

Page 64: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

64

P2P Data Management, 2006, S. Abiteboul 64/89

The AXML algebra – conclusion

Captures distributed XML query processing/optimization

Based on a communication model a la CCS

Algebraic – stream-oriented

Orthogonal to the local XML query optimizer

Orthogonal to the network support (DHT, small world etc.)

What is not yet available? A cost model and heuristics

Page 65: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

65

P2P Data Management, 2006, S. Abiteboul 65/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)

4. Indexing in P2P (KadoP)

5. Conclusion

Page 66: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

66

4. P2P XML indexing and query processing

Page 67: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

67

P2P Data Management, 2006, S. Abiteboul 67/89

Efficient evaluation of tree-pattern-queries

Many optimization techniques

We are interested here in distributed query evaluation/optimization

1) We consider XML indexing

2) Holistic twig join that is based on indexing

3) P2P indexing

4) P2P query processing

5) Optimizing P2P indexing

Page 68: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

68

P2P Data Management, 2006, S. Abiteboul 68/89

XML indexing: structural identifiers

1

2

3 5

6

7

84

6

6

6

8

8

8

X ancestor of Y <=>pre(X) < pre(Y) andpost(X) ≥ post(Y)

X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix

A

B

D E

C

F

“John” G

0

1 1

22 2

3

443

-Level

Page 69: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

69

P2P Data Management, 2006, S. Abiteboul 69/89

Holistic Twig Join

Input a document and a tree pattern query

Find the bindings of the query in the document

Holistic = holistique

(le tout et pas juste les parties)

Twig = brindille

Join = you know…

Sounds like Harry Potter?

Page 70: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

70

P2P Data Management, 2006, S. Abiteboul 70/89

Query evaluation over a document

Ids for A(1,8,0)…

Ids for D

Ids for “John”

“John”

Ids for C

Ids are sorted in lexicographical orderGoals is to find “matching Ids”

A

DC

Page 71: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

71

P2P Data Management, 2006, S. Abiteboul 71/89

The Holistic Twig Join Algorithma

b

c

a (3,5

c (4,5)

a (5,5)

c (6,8)

c (7,8)

a (8,8)

b (10,11)

c (11,11)

b (12,14)

b (13,14)

c (14,14)

a (16,17)

c (17,17)

b (19,22)

b (20,21)

c (22,22)

c (23,25)

b (24,25)

c (25,25)

a (2,8) c (9,14) a (15,17) a (18,25)

c (21,21)

r (1,25)0

1

2

3

4

level

Page 72: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

72

P2P Data Management, 2006, S. Abiteboul 72/89

Legend:

c11c9

a

c8c7

a6

b3

c6c4

The Holistic Twig Join Algorithm

a1

a2

a3

a4

a5 a7

b1 b2 b4

b5

c1 c2

c3

c5c9

b

c

(a7, b4, c8), (a7, b5, c8),

Sc

Sb

Sa

a7

b4

c8

b6

b5

c10

(a7, b4 ,c9)

b6

c11

(a7 ,b6 ,c11)

Head of the streamFind the match for the query sub-tree determined by this node !!!The ID is present also in the stack

Stacks

This is the end

Page 73: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

73

P2P XML processing

Page 74: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

74

P2P Data Management, 2006, S. Abiteboul 74/89

XML indexing in Xyleme

History• 1999: INRIA research project • 2000: Creation of a spin-off• 2006: About 25 people

Technology• A scalable XML repository• A content warehouse • On a cluster of Linux PC

XML query processing • Twig join• Index is distributed• Keyword-based vs. document

based

LAN

Put(C;[d,p,6,6,1])Put(“John”;[d,p,3,1,2])

hash(C)

hash(“John)

Page 75: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

75

P2P Data Management, 2006, S. Abiteboul 75/89

Query processing over a distributed collection

Ids for A(p12,d456, 1,7,0)…

Ids for D

Ids for John

A

DC

“John”

Ids for C

Ids include peerId and docIdIds are sorted in lexicographical orderGoals is to find “matching Ids” in the collection

Page 76: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

76

P2P Data Management, 2006, S. Abiteboul 76/89

XML indexing in KadoP

Use structural Ids

Publish them via a DHT• Distributed Hash Table

• Peers come and go

• Locate(k):: log(n) messages to fing the peer in charge of key k

• Put(k,v)

• Get(k): retrieves all the values for k

We use Pastry • We also tried P2PSim and JXTA

DHT

put(C;[d,p,6,6,1])

put(“John”;[d,p,3,1,2])

hash(C)

hash(“John)

posting for C

put(C;[d,p,6,6,1])

Page 77: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

77

P2P Data Management, 2006, S. Abiteboul 77/89

XML query processing in KadoP

Given a tree pattern query Q

Evaluate an index query indexQ to locate the peers that can provide some answers

• indexQ is a twig join

Ship Q to these peers and evaluate it there

If indexQ is imprecise, many false positive• Example: ship Q to all peers (maximal parallelism)

• Example: Instead of structural Ids, just use (label/word,peerId,docId)

Page 78: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

78

P2P Data Management, 2006, S. Abiteboul 78/89

KadoP architecture

KadoP peerpublish & query

KadoP Engine

DHT locate, put, get & delete Index

Indexing

Queryprocessing

ActiveXMLengine

Web interface

Semantic layerExternalLayer

LogicalLayer

PhysicalLayer

Page 79: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

79

P2P Data Management, 2006, S. Abiteboul 79/89

Some technical issues

Our goal: manage millions of documents with a large number of peers

First experiments were a disaster

Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB)

Extend the API of the DHT to have Append and not only Read/Write

Extend the API of the DHT to have a streaming exchange of postings• Useful because the XML algebra works better with streams

Now it scales but there is the issue of long postings

Page 80: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

80

P2P Data Management, 2006, S. Abiteboul 80/89

The issue of long postings: Google in P2P

Using keyword distribution

Suppose • Peer for Ullman is in Europe• Peer for XML is in US

we have to ship one long posting between US and Europe

For a large number of users, we absorb all the bandwidth of Internet backbone

Need for replication

Even for thousands of peers, the exchange of long postings is an issue

DHTUllmanxml

Ullmann & xml?

Page 81: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

81

P2P Data Management, 2006, S. Abiteboul 81/89

Intensional indexing in KadoP

long postingh(Name)

Distributed B-tree

Long posting = bad response time

1. No long posting

2. get h(name) then parallel fetch

3. Possibility to optimize further

f(docId55..docId75)

may be it does not match

no need to call f

h(Name)

f g h i

Page 82: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

82

P2P Data Management, 2006, S. Abiteboul 82/89

More optimization

• Standard for P2P keyword search– Gap compression and adaptive set intersection

• Standard distributed query optimization techniques– Ship smallest list – Load balancing– Caching– Replication

• Semi-join techniques notably Bloom semi-join

Page 83: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

83

P2P Data Management, 2006, S. Abiteboul 83/89

Outline

1. Introduction – the data ring

2. Calculus for P2P data management (ActiveXML)

3. Algebra for P2P data management (ActiveXML algebra)

4. Indexing in P2P (KadoP)

5. Conclusion

Page 84: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

84

6. Conclusion

Page 85: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

85

P2P Data Management, 2006, S. Abiteboul 85/89

Logic for distributed data management• Opinion: XQuery is a language for local XML management

• Proposal: ActiveXML

Algebraic foundation of distributed query optimization• Proposal: ActiveXML algebra

P2P (Active) XML indexing• KadoP is now being tested and we are working on optimization

Software• ActiveXML is open-source – see activexml.net

• KadoP soon will be – already available upon request

• EDOS distribution system as well

Conclusion

Page 86: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

86

P2P Data Management, 2006, S. Abiteboul 86/89

Lots of related work and related systems

This is going very fast in system devepments

Structured P2P nets: Pastry, Chord

Content delivery net: Coral, Akamai

XML repositories: Xyleme, DBMonet

Multicas systemst: Avalanche, Bullet

File sharing systems: BitTorrent, Kazaa

Pub/Sub systems: Scribe, Hyper

Distributed storage systems: OceanStore, GoogleFS

Etc.

Fundamental research is somewhat left behind

Page 87: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

87

P2P Data Management, 2006, S. Abiteboul 87/89

Issues

P2P query optimization

P2P access control

P2P archiving

P2P self tuning

P2P monitoring

P2P knowledge management [SomeWhere]

Also: analysis and verification of these systems• E.g., termination, error detection, diagnosis

Page 88: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

88

P2P Data Management, 2006, S. Abiteboul 88/89

Find your own topic

Pick your favorite problem for data or knowledge management and study it in a P2P setting

with gigabytes of data and thousands of peers

If you find it boring, consider it

with terabytes of data and millions of peers

Page 89: 1 P2P Data Management, 2006, S. Abiteboul1/89 Information Management in P2P Serge Abiteboul INRIA-Futurs and Univ. Paris 11

89

P2P Data Management, 2006, S. Abiteboul 89/89

Merci

Merci