1 adbis06, data ring1 data ring: let us turn the net into a database! serge abiteboul, inria-futurs...

62
1 ADBIS06, Data ring 1 Data Ring: Let us turn the net into a database! Serge Abiteboul, INRIA- Futurs Neoklis Polyzotis, UC Santa-Cruz

Post on 22-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

1

ADBIS06, Data ring 1

Data Ring: Let us turn the net into a database!

Serge Abiteboul, INRIA-Futurs

Neoklis Polyzotis, UC Santa-Cruz

2

Introduction

3

ADBIS06, Data ring 3

Success stories after the Internet bubble

Google: management of Web pages

Mapquest: management of maps

Amazone: book catalogue

eBay: product catalogue

Napster (emule, bearshare, etc.): music database

Flickr: picture database

Wikipedia: dictionary They are all aboutpublishing some database

4

ADBIS06, Data ring 4

A trend is towards peer-to-peer

P2P:

A large and varying number of computers cooperate to solve some particular task without any centralized authority

Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network

Examples• seti@home: search for extraterrestrial intelligence • kazaa: obtain free music/video over the net• cabal: decryption of 512 bits RSA code • grub: P2P Web search

5

ADBIS06, Data ring 5

A trend is towards peer-to-peer

The Web is switching from centralized servers to communities and syndication

Buzzwords such as Web 2.0, infoware, annotations, interactivity

Some motivations• Social, organizational: no need to have a leader, more flexible

• Technical: performance, availability, no bottleneck

Analogy: • Software development by very structured and controlled groups of

programmers, e.g., Windows

• Open-source software produced by large communities of autonomous developers, e.g., Linux

6

ADBIS06, Data ring 6

Data ring

Information management in a P2P network

Information is heterogeneous, distributed, replicated and dynamic• Data + knowledge + services

Peers are heterogeneous, autonomous and possibly mobile• Heterogeneous storage, bandwidth, computing power • From sensors to PDA to mainframe

Variety of requirements in terms of number of peers, QoS, performance, security, etc.

We will see motivations and examples

7

ADBIS06, Data ring 7

Acknowledgement – people & projects

Xyleme: Sophie Cluet, Guy Ferran & many others

Scalable XML repository

A start-up in 2000

ActiveXML within EC DbGlobe project: Omar Benjelloun, Ioana Manolescu, Tova Milo & many others

Framework for 2P data management

KadoP within EC Edos project: Ioana Manolescu, Nicoleta Preda & many others

P2P scalable XML repository

8

ADBIS06, Data ring 8

Outline

Introduction

Data ring

XML query processing

P2P XML query processing

Logic and algebra for distributed data management

Automatic and distributed management

Other issues in brief

Conclusion

9

Data ring

10

ADBIS06, Data ring 10

Functionalities:DBMS, content management and more

Storage, persistence, replication

Indexing, caching, query, update, optimization

Schema management, access control

Fault tolerance, self tuning, monitoring

Semantic enrichment, information retrieval, uncertain data, result ranking, multi-lingual…

Resource discovery, history, provenance

Each functionality may be achieved by a peer or by the network

This may depend on the nature of the application

11

ADBIS06, Data ring 11

The information in a data ring

Data: tuples, collections, documents, relations, objects…

Services: data sources, possibly some computation (e.g. genome)

Knowledge

• Meta-data about resources: attribute/values pairs, annotations

• Ontologies to explain data and metadata

• Mapping between ontologies (for data integration)

• Views (possibly recursive)

Physical data

• Indices

• Materialized views

• Statistics for query optimization

12

ADBIS06, Data ring 12

And now, what is a peer?

A mainframe database

A file system

Web server

A PC

A PDA

A telephone

A sensor

A home appliance

A car

A manufacturing tool

A telecom equipment

A toy

Another data ring

Any connected device or software with information to share

+ a net address + names for its resources

d@p, s@p’

?

13

ADBIS06, Data ring 13

Example: Edos distribution system

A system for the management of Linux distribution

Joint work with Mandriva Software and U. Tel Aviv

Community of open-source developers: thousands

System releases: about 10 000 software packages + meta data• Package name, author; date, signature, documentation

• Collections of packages, dependencies between packages

Functionalities• Query the metadata (also query subscription)

• Retrieve packages

• Update an existing release

• Publish a new release (based on BitTorrent)

14

ADBIS06, Data ring 14

Main parameters of such applications

Number of peers

Number of documents

How volatile the peers are

The query workload

The update workload

The “network distance” between the peers

Edos: peers and documents in thousands, mostly append for updates, peers not too volatile

An extreme: Google search engine in P2P for billions of documents using millions of volatile peers

15

ADBIS06, Data ring 15

Lots of related work and related systems

Structured P2P nets: Pastry, Chord

Content delivery net: Coral, Akamai

XML repositories: Xyleme, DBMonet

Multicast systems: Avalanche, Bullet

File sharing systems: BitTorrent, Kazaa

Pub/Sub systems: Scribe, Hyper

Distributed storage systems: OceanStore, GoogleFS

Data spaces: Franklin, Halevy

16

ADBIS06, Data ring 16

Standards: The golden triangle of distributed content management on the Web

Standard for data exchange• XML, XML Schema…

• Extensible Markup Language

• Labeled ordered trees

• Foundations: tree automata

Query languages• XPATH, XQuery…

Standards for distributed computing: Web services • SOAP, WSDL, UDDI, BPEL…

• Simple Object Access Protocols

• Corba but simpler and on the Web

SOAPWSDL

XML

XqueryXpath

17

ADBIS06, Data ring 17

Information used to live in islands but it is changing

Different formats: relational, metadata, documents, text, DXF• A Web standard for data exchange, XML, is fixing it • XML captures all kinds of information over a wide spectrum • XML comes with a family of emerging standards: XML schema, XSL/T,

Xquery, domain specific schemas…

Different computers, platforms, languages, applications• A standard for Web services, SOAP, is fixing it• SOAP allows ubiquitous computing on the Internet• SOAP comes with a family of emerging standards: WSDL, UDDI

This provides a uniform access to information… …the dream for distributed data management

18

ADBIS06, Data ring 18

Why XML? Because of its information spectrum

Structured Data

Minimal structure

Meta dataHierarchy +

Books Contracts Catalogs Bank accounts

Emails Financial Reports Insurance Policies

Economical Analysis Derivatives Inventory

Political analysis Insurance Claims

Financial News Sports News Resumes

19

ADBIS06, Data ring 19

A standard for information: XML

Labeled ordered trees where leaves are text

Marriage of document and database worlds

Marriage of full text indexing and structure indexing

Is it the ultimate data model? No

Purely syntax – more semantics needed

Is it OK for now? Definitely yes (because it is a standard)

20

ADBIS06, Data ring 20

The main asset of XML (vs. HTML) = typing

Applications need typing

XML typing: DTD and XML schema

Trees

Semantics and structure are in tags and paths• product-table/product/reference• product-table/product/price

product

designation descriptionprice

reference

product-table

21

ADBIS06, Data ring 21

A standard for distributed computing: Web services

Possibility to activate a method on some remote Web server

Exchange information in XML: input and result are in XML

Ubiquitous XML distributed computing infrastructure

2 main applications• E-commerce• Access to remote data

With XML and Web services, it is possible• To get information from virtually anywhere• To provide information to virtually anywhere

22

ADBIS06, Data ring 22

A standard for XML queries: Xquery (soon a recommendation of W3C)

A “logic” for labeled tree – a declarative language

Inspired by SQL: standard for relation data

Inspired by OQL: standard for object databases• Functional as OQL

• Not as clean as OQL

Mixes structure and content• Give me the documents where the word XML appears in title

• Some full-text extension is coming

Also an update language and full-text capabilities

23

ADBIS06, Data ring 23

XML and Xquery

XML focuses on lists• Represent a very set of objects as a list

• In our context, overhead in terms of index maintenance and query processing

• Here we are mostly concerned with a “set-oriented” view of XML

Xquery is very complicated and powerful• OK for large documents on a server

• Not appropriate for huge volumes of small documents on the Web

• Here we are mostly concerned with tree pattern queries (a subset of XQuery) and full-text search

24

ADBIS06, Data ring 24

So to be honest

We take only what we want in the standard• Ignore order

• Ignore complex features of XQuery

And even that is not what we need…

25

ADBIS06, Data ring 25

Thesis

This is fine for efficient query processing of local XML

This is not sufficient for distributed data management and flexible information integration

What made the success of the relational model, i.e., of tables on a server :1. A logic for defining tables

2. An algebra for describing query plans over tables

By analogy, we need for trees in a P2P system1. A logic for defining distributed data and data services

2. An algebra for describing query plans over these

Proposal: ActiveXML and query optimization around it

26

ADBIS06, Data ring 26

What we do next?

XML processing on a server

XML processing in a P2P system

ActiveXML

Then other data ring functions

27

XML query processing

T

28

ADBIS06, Data ring 28

Xyleme – in short

1999: Xyleme research project at INRIA

2000: Creation of a spin-off

2006: About 25 people

Technology: a content warehouse built around a very efficient and scalable XML repository

Example: all articles of Le Monde in XML, financial reports in JP-Morgan

29

ADBIS06, Data ring 29

Xyleme Functionalities

Store & IndexStore & Index

Query ProcessingQuery Processing

View and Semantic View and Semantic stemming, integration, classification…stemming, integration, classification…

WebWeb

GU

I , Web

servi ces , repo

rt ing

Feed

ing

Exp

loitin

g

30

ADBIS06, Data ring 30

Xyleme Architecture

XMLstore

Index

Loader| Local | Query

Global Query Manager

Global Query Manager

Application Server Tomcat|

Soap

Corba

Name ServerUser ManagerUrl Manager

Notification Mgr

HTTP | Web Service API

ApplicationsIE/Java/C++/.Net

...

Java/C++ API

or

Or Any Platform

XMLstore

Index

Loader| Local | Query

XMLstore

Index

Loader| Local | Query

Client sideClient side

Server sideServer side

31

ADBIS06, Data ring 31

Structural identifiersand indexing

1

2

3 4

5

6

71

2

3

4

5

6

7

X ancestor of Y <=>pre(X) < pre(Y) andpost(X) > post(Y)

X parent of Y <=>X ancestor of Y andlevel(X) = level(Y) - 1Structural IDs = Prefix-Postfix-Level

A

B

D E

C

F

“John” G

0

1 1

22 2

3

LAN

Put(C;[d,p,6,6,1])

Put(“John”;[d,p,3,1,2])

hash(C)

hash(“John)

32

ADBIS06, Data ring 32

Query evaluation based on Holistic twig joins

(d1, 201, 400)

(d1, 224, 201) (d1, 228, 237)

(d1, 228, 237)

A

DC

“John”

33

P2P XML processing

34

ADBIS06, Data ring 34

Moving this to a P2P network – KadoP

Advantages: scaling Disadvantages: complexity

Performance• Optimization of parallelism• Avoid bottleneck• Replication

Availability • Replication

Cost• Avoid the cost of server• Share operational cost

Dynamicity• add/remove new data sources

Performance• Cost for complex queries

• Communication cost

Availability• Peers can leave

Consistency maintenance • Difficult to support transaction

Quality• Difficult to guarantee quality

35

ADBIS06, Data ring 35

Distributed hash tables

Typically on a WAN

Peers come and go

Standard interface • Locate(k):: log(n) messages to fing

the peer in charge of key k • Put(k,v)• Get(k): retrieves all the values for k

We tried Pastry, Chord and JXTA

We now use Pastry

Not satisfactory for the system we want to develop

example: locate, put

DHT

put(k;v2)

hash(k)

get(k)

put(k;v1)

put(k;v3)v1,v2,v3

36

ADBIS06, Data ring 36

Indexing in KadoP

Use structured ID as in Xyleme

Publish them in a DHT

Use Holistic twig join

Main issue: communications • WAN vs. LAN

Long posting lists

Optimization techniques• Use only docID [wisconsin]• Ship smallest list• Semi-join techniques • Intensional indexing

DHT

put(C;[d,p,6,6,1])

put(“John”;[d,p,3,1,2])

hash(C)

hash(“John)

DHT

hash(C)

37

ADBIS06, Data ring 37

Such P2P systems scale… sort of…

If we consider • The Internet

• Millions of peers

• A “large” number of documents per peers

• A “reasonable” query load per peer

This is placing too much load on the network• Lots of queries will result in intersecting posting lists on peers that are

remote, e.g. US and Europe

Besides parallelism, we should use replication intensively

To intersect two long posting lists, I should be most likely to find two replicas that are nearby eachother

Very interesting distributed data management problem

38

Logic and algebra for distributed data management: ActiveXML

39

ADBIS06, Data ring 39

ActiveXML

Recall the standards of distributed data management: XML, Web services, XQuery

ActiveXML = XML documents with embedded Web service calls where service calls are typically in Xquery

Web service provides:

Intensional data: not explicitly here

Dynamic: a new call may provide new data

SOAPWSDL

XML

XqueryXpath

40

ADBIS06, Data ring 40

ActiveXML = XML + embedded service calls(omitting syntactic details)

<resorts state=‘Colorado’> <resort> <name> Aspen </name> <scond> Unisys.com/snow(“Aspen”)

</scond> <hotels ID=AspHotels > …. Yahoo.com/GetHotels(<city name=“Aspen”/>) </hotels> </resort> …</resorts>

May contain callsto any SOAP web serviceto any ActiveXML web services

<depth unit=“meter”>1</depth>

41

ADBIS06, Data ring 41

Not a new idea in databasesNot a new idea on the Web

Mixing calls to data is an old idea• Procedural attributes in relational systems

• Basis of Object Databases

In HTML world • Sun’s JSP, PHP+MySQL

Call to Web services inside documents• Macromedia MX, Apache Jelly

42

ADBIS06, Data ring 42

What exactly to exchange

A parameter of a call contains some service calls

The result of a call contains some service calls

Do we evaluate these calls before transmitting the data or not

Hi Pierre, what is the phone number of Number 10 in the French soccer team?• Find his name in wc2006.com/France/team.xml

then look in fifa.PhoneBook.org

• Look for Zidane in fifa.PhoneBook.org

• 33 4 13 00 00 01

43

ADBIS06, Data ring 43

When to activate the callExplicit pull mode

• Frequency: Daily, weekly, etc.• After some event: e.g., when another service call completed• This aspect of the problem is related to active databases

Implicit pull mode : Lazy• When the data is requested • Difficulty : detect that the result of a particular request may be

affected by a particular call• This is related to deductive databases

Push mode• E.g., based on a query subscription; the web server pushes

information to the client• E.g., synchronization with an external source• This is related to stream and subscription queries

44

ADBIS06, Data ring 44

ActiveXML peer

Peer-to-peer architecture

Each ActiveXML peer • Repository: manages

ActiveXML data with embedded web service calls

• Web client: uses Web services

• Web server: provides (parameterized) queries/updates over the repository as web services

Open source system

SUN’s Java SDK 1.4 • XML parser• XPath processor, XSLT engine

Apache Tomcat 4.0 servlet engine

Apache Axis SOAP toolkit 1.0

X-OQL query processor• persistent DOM repository

JSP-based user interface• JSTL 1.0 standard tag library

see http://activexml.net

ActiveXMLpeerso

ap

45

ADBIS06, Data ring 45

ActiveXML is a logic for distributed tree management… and also an algebra

ActiveXML extended• Generic services: s@any

• Send and receive (work on multicasting)

• Stream and EoS

ActiveXML is restricted: recall we ignore node ordering

Novelties• Stream of trees vs. set of relations for the relational model

• Distribution: exchange of query plans

• Optimization & evaluation are interleaved

46

ADBIS06, Data ring 46

A simple example

paper

author title

“Ullman” “magic set”

//paper[//author contains “Ullman”][//title contains “magic set”]

Q@any

Ring

Q@any

indexQ@local

Q@any

*

d21@p2 d12@p6

Query plan: evaluate: q locally or ship it to p2,p6

47

ADBIS06, Data ring 47

A simple example - continued

Optimization of the Index computation (indexQ)

send send

get get

get get

“Ullman”

send

Author

send

Title

send

“magic set”

send

Author

“Ullman”

send

“magic set”

get get

get

Title get

send

48

ADBIS06, Data ring 48

Distributed query optimization

This provides an algebra for distributed query processing

Allows interleaving query processing and query optimization

Can interact with local optimizers

Based on XML flow and local XML flow processing

Most techniques that I know of can be captured• Horizontal decomposition, semi-join, replication, magic set…

Main issue to guide optimization is the gathering of statistics

49

Automatic and distributed management of the data ring

50

ADBIS06, Data ring 50

Context

No centralized server

No information administrator (no info manager)

Most users are non-experts• E.g., scientists

Requirements• Ease of deployment (zero-effort)

• Ease of administration (zero-effort)

• Ease of publication (epsilon-effort)

• Ease of exploitation (epsilon-effort)

• Participation in community building notably via annotations

51

ADBIS06, Data ring 51

What should be made automatic

Self-statistics from the monitoring of the Data ring• In particular, define the statistics that are needed*

Self-tuning based on the self-statistics• Choose the most appropriate organization

• Decide to install access structures: indexes, views, etc.

• Control replication of data and services

Self-healing• Recovery from errors

• E.g., replacement of a failing Web service

And automatic file management

52

ADBIS06, Data ring 52

Automatic file management

Lots of data are in file systems (not managed by a database)• Scientific data often

• Personal data

Need to also optimize access to it• Indexing, efficient filtering, replication, etc.

• Algebra for file access

• Concurrency control, versioning, change control, etc.

53

ADBIS06, Data ring 53

Any hope?

Technology exists (database self-tuning, machine learning, etc.)

But self-tuning for databases has advanced very slowly

Why can this work?

1. There is no alternative (for db, this was just a cool gadget)

2. KISS (keep it simple stupid!)

3. The power of heavy parallelism and heavy replication

Suppose 2 indexes give the best performance; you can afford to ask 20 peers to each try 5 to see what works

This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true)

54

Other issues in brief

55

ADBIS06, Data ring 55

Distributed access control

Goal: Control access to ring resources

Access to resources is based on access rights (ACL)

Who is controlling ACLs?

A node manages ACLs for a collection of distributed resources• Easy but against the spirit and possible bottleneck

The network manages access control• Anybody can get the data• The data is published with encryption and signatures; only nodes with

proper access rights can perform reads/writes – More precisely, one can check whether all updates were valid

• Some techniques exist• Verification

56

ADBIS06, Data ring 56

Surveillance

Monitoring on the ring• Updates of a data collection• Some data streams • Web service calls in a Web server• The execution of some application/query • Some RSS feeds

The output is a stream • A main cause of streams

Many applications• Business intelligence • Self-statistics and tracing• Error diagnosis• Technological watch

PubList

Monitorstreams

Outputstream

MonitorXML

collection

updates

streams

57

ADBIS06, Data ring 57

Streams are everywhere

In query processing

KadoP example

Recursive queries

In messaging, monitoring and continuous query (sub)

That is why we use an algebra over streams of trees and not simply an algebra over trees

58

Conclusion

59

ADBIS06, Data ring 59

Logic for distributed data management• Opinion: Xquery is a language for local XML management

• Language for distributed query management? ActiveXML?

Algebraic foundation of distributed query optimization• Recent proposal: ActiveXML + send/receive

KadoP and P2P (Active) XML indexing• Now being tested and working on optimization

ActiveXML is open-source – see activexml.net

KadoP soon will be – already available upon request

Edos distribution system will soon be (with Mandriva)

Already achieved

60

ADBIS06, Data ring 60

On-going work

P2P XML processing (with Ioana Manolescu and Nicoleta Preda)

Distributed query optimization (with Ioana Manolescu and Alkis Polyzotis)

Self tuning (joint work with Alkis Polyzotis)

Distributed access control (joint work with Bogdan Cautis)

Monitoring of data rings (with Bogdan Marinoiu and DistribCom group in INRIA-Rennes)

Fuzzy XML processing (with Pierre Sennelart)

Verification of ActiveXML systems and data flow (with same)

61

ADBIS06, Data ring 61

Find your own topic

Take an arbitrary problem for data or knowledge management and look at it in the P2P setting with gigabytes of data and thousands of peers

If not yet interesting, consider terabytes of data and millions of peers.

62

ADBIS06, Data ring 62

Merci

Merci