kadop: a p2p content sharing system serge abiteboul inria-futurs (orsay) and university paris sud

33
1 ¨MDP2P – S. Abiteboul - 2006 1 KadoP: a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

Upload: swain

Post on 20-Mar-2016

42 views

Category:

Documents


1 download

DESCRIPTION

KadoP: a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud. Context. MDP2P – Project “ Masse de Données en P2P ” KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

1

¨MDP2P – S. Abiteboul - 2006 1

KadoP: a P2P content sharing system

Serge AbiteboulINRIA-Futurs (Orsay) and University Paris Sud

Page 2: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

2

¨MDP2P – S. Abiteboul - 2006 2

Context

MDP2P – Project “ Masse de Données en P2P ”

KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta)

Article in EDBT and demo in DataEngineering

Page 3: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

3

¨MDP2P – S. Abiteboul - 2006 3

Organization

Introduction

The basis: XML, DHT, ActiveXML

KadoP

Query processing

The implementation

Conclusion

Page 4: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

4

Introduction

Page 5: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

5

¨MDP2P – S. Abiteboul - 2006 5

Peer-to-peer

A large and varying number of computers cooperate to solve some particular task without any centralized authority

Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network

Examples• seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net• cabal: decryption of 512 bits RSA code • grub: P2P Web search

Page 6: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

6

¨MDP2P – S. Abiteboul - 2006 6

Data management in P2P

Publication of resources (XML and knowledge)

Storage of resourcesAccess to resourcesAcquisition/Enrichment/Exploitation

Focus here on query processing

Precise answers taking into account the text, the structure and the semantics of XML documents.

Internet

PeerPeerPeer

PeerPeer

Peer

Page 7: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

7

The basis: Standards + ActiveXML+ DHT

Page 8: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

8

¨MDP2P – S. Abiteboul - 2006 8

Standards of distributed data management

Standard for data exchange: XML• Extensible Markup Language• Labeled ordered trees

Standard for query languages• XPATH, Xquery

Standards for distributed computing• Web services: SOAP, WSDL

ActiveXML = XML documents with embedded Web service calls• Intensional• Dynamic

Xquery

XpathSOAPWSDL

XML

Page 9: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

9

¨MDP2P – S. Abiteboul - 2006 9

ActiveXML = XML + embedded service calls(omitting syntactic details)

<resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”)

</scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts> May contain calls

to any SOAP web serviceto any ActiveXML web services

<depth unit=“meter”>1</depth>

Page 10: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

10

¨MDP2P – S. Abiteboul - 2006 10

ActiveXML peer

Each ActiveXML peer • Repository• Web client: • Web server

Open-source in ObjectWeb see http://ActiveXML.netBased on standards libraries

• SUN’s Java SDK 1.4 (XML parser, XPath processor, XSLT engine)• Apache Tomcat 4.0 servlet engine, Apache Axis SOAP toolkit 1.0• X-OQL query processor (soon? Replaced by eXist XML-db)

ActiveXMLpeerso

ap

Page 11: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

11

¨MDP2P – S. Abiteboul - 2006 11

Distributed hash tables

locate(k)

put(k,v1): hash(k) determines peer Ph(k) where (k,v1) is kept

get(k) retrieves v1,v2… from Ph(k)

delete(k,v)

Management of the overlay network is complex because peers come and go

We use Pastry

We have tried others: Chord, Jxta

DHT

put(k,v1)

put(k,v2) k: v1

get(k)

,v2

Page 12: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

12

KadoP: a P2P content sharing system

Page 13: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

13

¨MDP2P – S. Abiteboul - 2006 13

KadoP data items

XML documents and web services• XML sub-trees, views and collections of documents• Labels, words and stemming of these words

Types• DTD and XSD for documents, WSDL for services

also

ActiveXML documents and ActiveXML services

Ontologies • Concepts, isa, etc.

Page 14: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

14

¨MDP2P – S. Abiteboul - 2006 14

Goal

Find relevant information to answer a query

May require some extensional information

May require to call some Web services

May require some elaborate query plan including service composition

Simple examples• Find me Emacs packages• Find me Emacs packages that were modified last week• Find me the packages depending on Emacs in my Linux system

Page 15: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

15

¨MDP2P – S. Abiteboul - 2006 15

KadoP architecture

KadoP peerpublish & query

KadoP Engine

DHT locate, put, get & delete Index

Indexing

Queryprocessing

ActiveXMLengine

Web interface Semantic layer

ExternalLayer

LogicalLayer

PhysicalLayer

Page 16: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

16

¨MDP2P – S. Abiteboul - 2006 16

Architecture

Java/JSP application on each peer

KadoP: Distributed index

EDOS distribution system

ActiveXML: Data/metadata storage

IDiP : dissemination management

BitTorrent : efficient download

IndexDataMetadata

PublisherED O S AP I

KadoP

Active XM L

ID iP

B itTorrent

optiona l

ClientEDO S AP I

Active XM L *

ID iP

B itTorrent

ReplicatorED O S AP I

KadoP

A ctive XM L

ID iP

B itTorrent

Com m unication through

web servicesKadop

Kadop

Page 17: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

17

P2P XML Query processing

Page 18: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

18

¨MDP2P – S. Abiteboul - 2006 18

Efficient evaluation of tree-pattern-queries

Many optimization techniques

We are interested here in distributed query evaluation/optimization

1) We consider XML indexing

2) Holistic twig join that is based on indexing

3) P2P indexing

4) P2P query processing

5) Optimizing P2P indexing

Page 19: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

19

¨MDP2P – S. Abiteboul - 2006 19

XML indexing: structural identifiers

1

2

3 5

6

7

84

6

6

6

8

8

8

X ancestor of Y <=>pre(X) < pre(Y) andpost(X) ≥ post(Y)

X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix

A

B

D E

C

F

“John” G

0

1 1

22 2

3443

-Level

Page 20: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

20

¨MDP2P – S. Abiteboul - 2006 20

Holistic Twig Join

Input a document and a tree pattern query

Find the bindings of the query in the document

Holistic = holistique

(le tout et pas juste les parties)

Twig = brindille

Join = jointure

Sounds like Harry Potter?

Page 21: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

21

¨MDP2P – S. Abiteboul - 2006 21

Query evaluation over a document

Ids for A(1,8,0)…

Ids for D

Ids for “John”

“John”

Ids for C

Ids are sorted in lexicographical orderGoals is to find “matching Ids”

A

DC

Page 22: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

22

¨MDP2P – S. Abiteboul - 2006 22

The Holistic Twig Join Algorithma

b

c

a (3,5

c (4,5)

a (5,5)

c (6,8)

c (7,8)

a (8,8)

b (10,11)

c (11,11)

b (12,14)

b (13,14)

c (14,14)

a (16,17)

c (17,17)

b (19,22)

b (20,21)

c (22,22)

c (23,25)

b (24,25)

c (25,25)

a (2,8) c (9,14) a (15,17) a (18,25)

c (21,21)

r (1,25)0

1

2

3

4

level

Page 23: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

23

¨MDP2P – S. Abiteboul - 2006 23

Legend:

c11c9

a

c8c7

a6

b3

c6c4

The Holistic Twig Join Algorithm

a1

a2

a3

a4

a5 a7

b1 b2 b4

b5

c1 c2

c3

c5c9

b

c

(a7, b4, c8), (a7, b5, c8),

Sc

Sb

Sa

a7

b4

c8

b6

b5

c10

(a7, b4 ,c9)

b6

c11

(a7 ,b6 ,c11)

Head of the streamFind the match for the query sub-tree determined by this node !!!The ID is present also in the stack

Stacks

This is the end

Page 24: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

24

¨MDP2P – S. Abiteboul - 2006 24

Also: Intensional data

Example: include and references

Example: function calls in ActiveXML

Find me the packages depending on Emacs in my Linux system

package (name, author, size, signature, dependsOn(self)…)

the depending packages are intensional

Naïve: return empty answer

Brutal: return all documents with a function call

What we do: use indexing (and typing)

Page 25: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

25

The implementation

Page 26: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

26

¨MDP2P – S. Abiteboul - 2006 26

Some technical issues

Common belief: this cannot work because of transfer delays• Indeed, first experiments were a disaster• DHT did not scale – not designed for so many entries• Transfers of long posting lists were killing the system

Our target : make it work in some modest setting

with millions of documents

with thousands of peers

with not too volatile peers

(not Kazaa or GoogleSearch but industrial application)

Page 27: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

27

¨MDP2P – S. Abiteboul - 2006 27

Let’s make it work

Some of the early observations of MDP2P and solutions• Replace the index storage of the DHT in a FS by storage in a database

(Berkeley DB)• Extend the API of the DHT to have Append and not only Read/Write• Extend the API of the DHT to have a streaming exchange of postings

(for long postings)– Useful because the XML algebra works better with streams

Now KadoP scales but can be optimized

We will see here one optimization technique: DPP

Page 28: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

28

¨MDP2P – S. Abiteboul - 2006 28

Distributed PostingPartitioning

Long posting = bad response time

1. No long posting

2. get h(name) then parallel fetch

3. Possibility to optimize further

f(docId55..docId75)

may be it does not match

no need to call f

long postingh(Name)

h(Name)

f g h i

Distributed B-tree

Page 29: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

29

¨MDP2P – S. Abiteboul - 2006 29

Performance

Page 30: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

30

¨MDP2P – S. Abiteboul - 2006 30

Main issues

Scaling: Optimize query processing• Adapting Bloom filter and other known techniques • on going in Gemo

Scaling: main tool is replication• Issue are consistency and overhead• On going work in MDP2P/Atlas

Dynamicity: better manage peers entering/leaving the system

Page 31: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

31

¨MDP2P – S. Abiteboul - 2006 31

A KadoP application: Data management in Edos

The distributed of a large software to the peers developing in

Mandriva Linux distribution: 10 000 packages + metadata between up to 1 000 peers

Thousands of packages (about 9000 in Mandriva)• Package metadata in XML• And why not: bug reports, annotations, emails, etc.

Goal: distribute & query & monitor & getPackage

Techno: ActiveXML + KadoP + Idip + BitTorrent

Page 32: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

32

¨MDP2P – S. Abiteboul - 2006 32

Conclusion

V1 of KadoP and EdosDistribution are running• Open-source

Management of XML resources in P2P• Management of semantic and web services as well• Based on active data (ActiveXML) and DHT (FreePastry)

Novelties in KadoP• Management of data and knowledge in P2P• Use of intensional information• Original optimization techniques

Future work• ANR Platform for content management: webContent

Page 33: KadoP:  a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud

33

¨MDP2P – S. Abiteboul - 2006 33

Merci

Merci