kadop: a p2p content sharing system serge abiteboul inria-futurs (orsay) and university paris sud
DESCRIPTION
KadoP: a P2P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud. Context. MDP2P – Project “ Masse de Données en P2P ” KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta) - PowerPoint PPT PresentationTRANSCRIPT
1
¨MDP2P – S. Abiteboul - 2006 1
KadoP: a P2P content sharing system
Serge AbiteboulINRIA-Futurs (Orsay) and University Paris Sud
2
¨MDP2P – S. Abiteboul - 2006 2
Context
MDP2P – Project “ Masse de Données en P2P ”
KadoP: Joint work with Ioana Manolescu and Nicoleta Preda, INRIA-Futurs (Orsay) and University Paris Sud (thesis of Nicoleta)
Article in EDBT and demo in DataEngineering
3
¨MDP2P – S. Abiteboul - 2006 3
Organization
Introduction
The basis: XML, DHT, ActiveXML
KadoP
Query processing
The implementation
Conclusion
4
Introduction
5
¨MDP2P – S. Abiteboul - 2006 5
Peer-to-peer
A large and varying number of computers cooperate to solve some particular task without any centralized authority
Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network
Examples• seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net• cabal: decryption of 512 bits RSA code • grub: P2P Web search
6
¨MDP2P – S. Abiteboul - 2006 6
Data management in P2P
Publication of resources (XML and knowledge)
Storage of resourcesAccess to resourcesAcquisition/Enrichment/Exploitation
Focus here on query processing
Precise answers taking into account the text, the structure and the semantics of XML documents.
Internet
PeerPeerPeer
PeerPeer
Peer
7
The basis: Standards + ActiveXML+ DHT
8
¨MDP2P – S. Abiteboul - 2006 8
Standards of distributed data management
Standard for data exchange: XML• Extensible Markup Language• Labeled ordered trees
Standard for query languages• XPATH, Xquery
Standards for distributed computing• Web services: SOAP, WSDL
ActiveXML = XML documents with embedded Web service calls• Intensional• Dynamic
Xquery
XpathSOAPWSDL
XML
9
¨MDP2P – S. Abiteboul - 2006 9
ActiveXML = XML + embedded service calls(omitting syntactic details)
<resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”)
</scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts> May contain calls
to any SOAP web serviceto any ActiveXML web services
<depth unit=“meter”>1</depth>
10
¨MDP2P – S. Abiteboul - 2006 10
ActiveXML peer
Each ActiveXML peer • Repository• Web client: • Web server
Open-source in ObjectWeb see http://ActiveXML.netBased on standards libraries
• SUN’s Java SDK 1.4 (XML parser, XPath processor, XSLT engine)• Apache Tomcat 4.0 servlet engine, Apache Axis SOAP toolkit 1.0• X-OQL query processor (soon? Replaced by eXist XML-db)
ActiveXMLpeerso
ap
11
¨MDP2P – S. Abiteboul - 2006 11
Distributed hash tables
locate(k)
put(k,v1): hash(k) determines peer Ph(k) where (k,v1) is kept
get(k) retrieves v1,v2… from Ph(k)
delete(k,v)
Management of the overlay network is complex because peers come and go
We use Pastry
We have tried others: Chord, Jxta
DHT
put(k,v1)
put(k,v2) k: v1
get(k)
,v2
12
KadoP: a P2P content sharing system
13
¨MDP2P – S. Abiteboul - 2006 13
KadoP data items
XML documents and web services• XML sub-trees, views and collections of documents• Labels, words and stemming of these words
Types• DTD and XSD for documents, WSDL for services
also
ActiveXML documents and ActiveXML services
Ontologies • Concepts, isa, etc.
14
¨MDP2P – S. Abiteboul - 2006 14
Goal
Find relevant information to answer a query
May require some extensional information
May require to call some Web services
May require some elaborate query plan including service composition
Simple examples• Find me Emacs packages• Find me Emacs packages that were modified last week• Find me the packages depending on Emacs in my Linux system
15
¨MDP2P – S. Abiteboul - 2006 15
KadoP architecture
KadoP peerpublish & query
KadoP Engine
DHT locate, put, get & delete Index
Indexing
Queryprocessing
ActiveXMLengine
Web interface Semantic layer
ExternalLayer
LogicalLayer
PhysicalLayer
16
¨MDP2P – S. Abiteboul - 2006 16
Architecture
Java/JSP application on each peer
KadoP: Distributed index
EDOS distribution system
ActiveXML: Data/metadata storage
IDiP : dissemination management
BitTorrent : efficient download
IndexDataMetadata
PublisherED O S AP I
KadoP
Active XM L
ID iP
B itTorrent
optiona l
ClientEDO S AP I
Active XM L *
ID iP
B itTorrent
ReplicatorED O S AP I
KadoP
A ctive XM L
ID iP
B itTorrent
Com m unication through
web servicesKadop
Kadop
17
P2P XML Query processing
18
¨MDP2P – S. Abiteboul - 2006 18
Efficient evaluation of tree-pattern-queries
Many optimization techniques
We are interested here in distributed query evaluation/optimization
1) We consider XML indexing
2) Holistic twig join that is based on indexing
3) P2P indexing
4) P2P query processing
5) Optimizing P2P indexing
19
¨MDP2P – S. Abiteboul - 2006 19
XML indexing: structural identifiers
1
2
3 5
6
7
84
6
6
6
8
8
8
X ancestor of Y <=>pre(X) < pre(Y) andpost(X) ≥ post(Y)
X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix
A
B
D E
C
F
“John” G
0
1 1
22 2
3443
-Level
20
¨MDP2P – S. Abiteboul - 2006 20
Holistic Twig Join
Input a document and a tree pattern query
Find the bindings of the query in the document
Holistic = holistique
(le tout et pas juste les parties)
Twig = brindille
Join = jointure
Sounds like Harry Potter?
21
¨MDP2P – S. Abiteboul - 2006 21
Query evaluation over a document
Ids for A(1,8,0)…
Ids for D
Ids for “John”
“John”
Ids for C
Ids are sorted in lexicographical orderGoals is to find “matching Ids”
A
DC
22
¨MDP2P – S. Abiteboul - 2006 22
The Holistic Twig Join Algorithma
b
c
a (3,5
c (4,5)
a (5,5)
c (6,8)
c (7,8)
a (8,8)
b (10,11)
c (11,11)
b (12,14)
b (13,14)
c (14,14)
a (16,17)
c (17,17)
b (19,22)
b (20,21)
c (22,22)
c (23,25)
b (24,25)
c (25,25)
a (2,8) c (9,14) a (15,17) a (18,25)
c (21,21)
r (1,25)0
1
2
3
4
level
23
¨MDP2P – S. Abiteboul - 2006 23
Legend:
c11c9
a
c8c7
a6
b3
c6c4
The Holistic Twig Join Algorithm
a1
a2
a3
a4
a5 a7
b1 b2 b4
b5
c1 c2
c3
c5c9
b
c
(a7, b4, c8), (a7, b5, c8),
Sc
Sb
Sa
a7
b4
c8
b6
b5
c10
(a7, b4 ,c9)
b6
c11
(a7 ,b6 ,c11)
Head of the streamFind the match for the query sub-tree determined by this node !!!The ID is present also in the stack
Stacks
This is the end
24
¨MDP2P – S. Abiteboul - 2006 24
Also: Intensional data
Example: include and references
Example: function calls in ActiveXML
Find me the packages depending on Emacs in my Linux system
package (name, author, size, signature, dependsOn(self)…)
the depending packages are intensional
Naïve: return empty answer
Brutal: return all documents with a function call
What we do: use indexing (and typing)
25
The implementation
26
¨MDP2P – S. Abiteboul - 2006 26
Some technical issues
Common belief: this cannot work because of transfer delays• Indeed, first experiments were a disaster• DHT did not scale – not designed for so many entries• Transfers of long posting lists were killing the system
Our target : make it work in some modest setting
with millions of documents
with thousands of peers
with not too volatile peers
(not Kazaa or GoogleSearch but industrial application)
27
¨MDP2P – S. Abiteboul - 2006 27
Let’s make it work
Some of the early observations of MDP2P and solutions• Replace the index storage of the DHT in a FS by storage in a database
(Berkeley DB)• Extend the API of the DHT to have Append and not only Read/Write• Extend the API of the DHT to have a streaming exchange of postings
(for long postings)– Useful because the XML algebra works better with streams
Now KadoP scales but can be optimized
We will see here one optimization technique: DPP
28
¨MDP2P – S. Abiteboul - 2006 28
Distributed PostingPartitioning
Long posting = bad response time
1. No long posting
2. get h(name) then parallel fetch
3. Possibility to optimize further
f(docId55..docId75)
may be it does not match
no need to call f
long postingh(Name)
h(Name)
f g h i
Distributed B-tree
29
¨MDP2P – S. Abiteboul - 2006 29
Performance
30
¨MDP2P – S. Abiteboul - 2006 30
Main issues
Scaling: Optimize query processing• Adapting Bloom filter and other known techniques • on going in Gemo
Scaling: main tool is replication• Issue are consistency and overhead• On going work in MDP2P/Atlas
Dynamicity: better manage peers entering/leaving the system
31
¨MDP2P – S. Abiteboul - 2006 31
A KadoP application: Data management in Edos
The distributed of a large software to the peers developing in
Mandriva Linux distribution: 10 000 packages + metadata between up to 1 000 peers
Thousands of packages (about 9000 in Mandriva)• Package metadata in XML• And why not: bug reports, annotations, emails, etc.
Goal: distribute & query & monitor & getPackage
Techno: ActiveXML + KadoP + Idip + BitTorrent
32
¨MDP2P – S. Abiteboul - 2006 32
Conclusion
V1 of KadoP and EdosDistribution are running• Open-source
Management of XML resources in P2P• Management of semantic and web services as well• Based on active data (ActiveXML) and DHT (FreePastry)
Novelties in KadoP• Management of data and knowledge in P2P• Use of intensional information• Original optimization techniques
Future work• ANR Platform for content management: webContent
33
¨MDP2P – S. Abiteboul - 2006 33
Merci
Merci