1 xml warehouse – 2002 1 xml warehousing and xyleme s. abiteboul inria and xyleme...

69
XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme [email protected] December 2002

Upload: brian-thornton

Post on 26-Mar-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 1

1

XML Warehousing and Xyleme

S. AbiteboulINRIA and [email protected] 2002

Page 2: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 2

2

Organization

• The context and motivations • XML warehouse• Xyleme: An XML warehouse

Zooms on some aspects of the technology– Scaling– Mass storage of XML– XML query processing– Semantic integration– Web page ranking– Query subscription

• Xyleme : the company, in very brief

Page 3: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 3

3

The context

The Web and XML are changing dramatically the world of distributed information

Page 4: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 4

4

The Web of yesterday

• Protocol: HTTP• Documents: HTML• Millions of independent web sites and billions of

documents• Browsing and keyword search (full-text indexing)• Publication of databases using forms• Data management with the Web

– HTML is primarily for humans– Data management applications on the Web

• Based on hand-made wrappers• Expensive, incomplete, short-lived, not adapted to the Web constant

change

• No real support for distributed data management!

Page 5: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 5

5What is changing

Information used to live in islands and a lot of its value was wasted

1. Different formats: relational, meta data, documents and text, data exchange formats…

– A Web standard for data exchange, XML, is fixing it – XML can capture all kinds of information over a wide spectrum

of information– XML comes with a family of emerging standards: XML

schema, XSL/T, Xquery, domain specific schemas…

2. Different computers, platforms, languages, applications– Web services, e.g., SOAP, are fixing it– SOAP allows ubiquitous computing on the Internet– SOAP comes with a family of emerging standards: WSDL,

UDDI

Page 6: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 6

6

What is changing

• XML and Web services provide a uniform access to information, independent of platform, system, language, communication protocol and data format…

• The dream for distributed data management

• The gathering, integration, consolidation, analysis of distributed information become feasible at a much lower cost

Page 7: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 7

7

(1) XML covers the information spectrum

Structured Data

Minimal structure

Meta dataHierarchy +

Books Contracts Catalogs Bank accounts

Emails Financial Reports Insurance Policies

Economical Analysis Derivatives Inventory

Political analysis Insurance Claims

Financial News Sports News Resumes

Page 8: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 8

8

• Very structured information such as databases– Most DBMS now export in XML

• Semi-structured data such as data exchange formats (ASN.1, SGML), e.g., technical documentation

• Documents – Meta-data: Author, date, status– Existing structure in them: chapter, section, table of

content and index– Possibly tagging of elements in it (citation, lists)– Links to other documents

• Meta data for unstructured data such as images and sound

• Plain text

XML covers the information spectrum

XML

Page 9: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 9

9XML’s asset: the marriage of text and structure

labeled ordered trees where leaves are text• Marriage of document and database worlds• Marriage of full text indexing (keyword search)

and structure indexing (SQL-style query)

• Is it the ultimate data model? No• Purely syntax – more semantics needed• Is it OK for now? Definitely yes (because it is a

standard)

Page 10: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 10

10

XML’s asset: typing

• Applications need typing and XML data can be typed if needed (DTD and XML schema)

• Trees

• Logical Granularity – neither page or document level – but the piece of information that is needed

• Semantics and structure are in tags and paths– product-table/product/reference– product-table/product/price

product

designation descriptionprice

reference

product-table

Page 11: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 11

11

HTML

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99

Information System HTML

The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.The new robot <b>R2D2</b>…

Text + presentation - Where is the data ?

hard

Page 12: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 12

12

XML

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...

Information System

XML

<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table>

Data + Structure = Semistructured(presentation elsewhere)

easy

Page 13: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 13

13(2) Web services and ubiquitous distributed computing

• Possibility to activate a method on some remote web server

• Exchange information in XML: input and result are in XML

• Ubiquitous XML distributed computing infrastructure• 2 main applications

– E-commerce– Access to remote data

• With XML and Web services, it is possible– To get information from virtually anywhere– To provide information to virtually anywhere

Page 14: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 14

14

Accessing remote information

Application using gene banks

Query some data services that provide candidate genes

Gene banks

processing

processingprocessing

Use some processing services

Heterogeneous formats,

protocols, etc.

Page 15: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 15

15

Same with web services

Query some data services that provide candidate genes

Gene banks

processing

processingprocessing

Use some processing services

Web

Application using gene banks

Uniform access to information

Page 16: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 16

16

XML and Web services

• Exchange of information– E-commerce, B2B, G2C– Cooperative work

• Information brokers– Web sites, portals– Content publication in general

• Mediation mode: get the XML pages when needed• Warehouse mode: load them in advance

Page 17: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 17

17

Advantages of a warehouse approach

• Allows for support of complex query processing with high performance

• Allows for complex analysis of the data • Allows for enriching the information• Allows for better monitoring of information• Allows for versioning, archiving, temporal queries if

needed

• Mediator approach is preferable or compulsory in some applications – Supply chain – Comparative shopping– Typically for volatile information such as plane ticket price

Page 18: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 18

18

XML warehouse

Page 19: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 19

19

Main functionalities

Feeding

Enrichment

ExploitationRepository

Admin GUI

User GUIEditing &

Pub

AccessReporting

Sub

View &Integration

User GUI

AP

I

AP

I

Warehousing Analysis(data warehouse) (OLAP)

Page 20: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 20

20Main functionalities

(1) Feeding• Loading from the Web (Internet and Intranet)

– Web search– Web crawl– Access Web data via forms or Web services

• Plug-ins to load from – File systems, document management systems– Data bases, LDAP– Newsgroup, emails– Other applications

• Extraction and transformation– XSL-T or Xquery mappings for XML sources– XML-izers to load data from other formats

• Monitoring of the feeding

Page 21: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 21

21Main functionalities

(1) Feeding – continued • User feeding

– Document editing– Meta data editing– Using WebDAV protocol

• Publication

• By GUI or from programs (SOAP-based API)

Page 22: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 22

22Main functionalities

(2) Repository• Storage of massive volume of XML (terabytes)• Indexing of massive volume of XML

– By structure– By full-text– Linguistic support: stemming, synonyms, etc.

• Very efficient XML query processing• Importance ranking• Monitoring of the warehouse (support for subscriptions)• Access control and security• Versioning, archiving• Recovery

• No full transaction mechanism

Page 23: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 23

23Main functionalities

(3) Enrichment• Global organization

– Global schema management • Management of collections

– Incorporate domain ontologies and thesauri– Document classification– Cleaning by filtering out documents from collections, etc.

• Document enrichment– Concept extraction and tagging– Cleaning inside de document– Summarization, etc.

• Relationships between documents– Tables of contents– Tables of index– Cross referencing, etc.

Page 24: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 24

24Main functionalities

(4) View and integration• View management

– Document restructuring/mapping– Schema to schema mapping

• Semantic integration– Manual for complex ones and (semi-) automatic for

simple ones– Tools to analyze a set of schemas– Tools to integrate them – Processing for queries on integration view

• Management of virtual data in a mediator style

Page 25: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 25

25Functionalities(5) Exploitation

• Access to the warehouse– Browsing– Querying by keywords, XPaths or Xquery – Temporal queries

• Query subscription • Reporting

– Generation of complex reports with pointers to documents, counts, abstracts…

– Organized by collections, content, domains…

• By GUI or from programs (Web service-based API)

Page 26: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 26

26Admin: Specify the lifecycle of information in the warehouse starting from its acquisition

• Specify with parameters (in red): documents to process• Add from a toolbox, some processing to apply (in pink)• Specify when processing should be applied (in green)

Loading from/u/news/*start now

TransformationBy some

XML-izer Xflow

Storage in Collection Z

flow

Classificationoff flow

Concept Taggingoff flow

Indexingflow

Monitoringof Yflow

Page 27: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 27

27

Specifying the enrichment

• What processing should be performed– Applications that come with the system– Arbitrary processing provided as Web services

• Interface of services– XML input: the documents or collection of documents in the

warehouse to be processed– XML output: the result

• Where to plug the result– Where to store the new documents (collections, names)– Where to put enrichments in existing documents

• When to start the processing– At the time the document is loaded– At some later time, assuming some information has already

been gathered (dependencies)

Page 28: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 28

28

Choose presentation style

User: queries and reporting

Choose thecollectionsof interest

Classify/group results for presentation and drilling

Quantity of resultsPreference ranking and possible relaxation

Choose the criteriaof selection

Choose what to extract as a result

WHERE CLAUSE SELECT CLAUSEFROM CLAUSE

PREFER CLAUSE

ORGANIZECLAUSE

STYLE CLAUSE

Page 29: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 29

29

Example

From collections MuséeRodin, WebMuseum, LACMAWhere Art_Item/ artist [Name=“Rodin”]Select Name, Owner, AnnotationsPrefer

1. Rodin in title page2. Owner is public or owner is in France– Get first 20

Organize as 1. Art_Item/material sculpture, painting, others2. Owner

Present as …

Page 30: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 30

30

XylemeAn XML warehouse

Zooms on some aspects of the technology

Page 31: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 31

31

Xyleme: a dynamic XML warehouse

• Scaling• Feeder

– E.g., loading with a single PC millions of Web documents per day – and scale up with more machines

• Repository – E.g., storing and indexing of tera Bytes of XML (other formats,

e.g., pdf)• Enrichment

– E.g., tools (together with partner) for classification and concept extraction

• View and semantic integration– E.g., a suite of tools of XML integration

• Exploitation– E.g., access via SOAP and graphic interfaces

Page 32: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 32

32

1. An architecture to scale

Page 33: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 33

33

The scaling

• Size of data: billions of XML documents • Size of data and index: terabytes• Number of customers

– thousands of simultaneous queries– millions of subscriptions

An architecture based on distribution

Page 34: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 34

34

Architecture

• Cluster of PCs

• Runs on Linux and C++ (also Solaris)

• Communications– local: Corba (Orbacus)– external: HTTP, SOAP

• Distribution between autonomous machines

Page 35: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 35

35

Functional architecture

Repository and Index Manager

Change Control

Query Processor

Semantic Module

User Interface

Xyleme Interface

-------------------- I N T E R N E T -----------------------

Web Interface

Acquisition& Crawler

Loader

Page 36: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 36

36

Architecture and scaling

Index Index Index

-------------------- I N T E R N E T -----------------------

Change Control andSemantic

Integration

Change Control andSemantic

Integration

ETHERNET

Repository Repository RepositorryRepository

Loader |Query Loader |Query

Acquisition andMaintenance

Acquisition andMaintenance

Page 37: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 37

37

2. Data Acquisition and Maintenance of Web pages (internet or intranet)

Page 38: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 38

38

• Discover HTML/XML pages on the web (intranet or internet)

• Parse/load pages and follow links• Manage metadata for the known pages• Do this under bounded resources

– Network bandwidth– Memory and disk resources

• Tested on the Internet in October 2001– Millions of pages crawled per day on each crawler– Up to 10 crawlers and close to 1 billion HTML/XML pages

discovered in a couple of months

Crawl le Web

Page 39: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 39

39

• Optimization problem– Decide which page to crawl or refresh next to

optimize the quality of the warehouse

• Criteria:– Read more often important pages

• Based on customer’s preferences• Page importance can also be used to order query results

– Don’t read a page that is probably up-to-date• Uses an estimate of the change frequency for each page

• Advantages– Have a fresh view of useful portions of information

Page SchedulingOptimization

Page 40: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 40

40

• Determine which page to read next – minimize a particular cost function under some

constraint (bandwidth of crawlers)

• The penalty for a page takes into account:– importance of the page (to be defined next)– customer needs (obtained via pub/sub)– staleness of the data

• penalty for being out of date• penalty for aging

• The page scheduler fully controls the crawling– vs. random crawling in classic search engines

Page scheduling

Page 41: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 41

41

• Based on customer’s criteria and on the link structure of the web

• Intuition: a page is important if many important pages reference it

• Fixpoint definition: importance vector Imp– Proposed by IBM; used by search engines such as Google – Link matrix: M(i,j) if page i refers to page j– Outdegree of page i: out(i)– Imp0(k) = 1/N (initialization)– Impm(k) = i [M(i,k) * Impm-1(i)/out(i) ] (iteration)– Imp is the limit

Page Importance

Page 42: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 42

42

• Novel technology developed by Xyleme

• Patent pending

• On-line evaluation of page importance

• Use much less resources

• Faster reaction to changes on the web

Page Importance

Page 43: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 43

43

2. XML Repository

Page 44: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 44

44

• Document systems– Good for keyword search– No or inefficient support for structure search

• Relational store (e.g., Oracle 8i)– Well adapted for some applications– Very typed data and Tables: efficient– Otherwise: too many joins and inefficient

• Object database store (e.g., Excellon) and Native XML databases (e.g., Tamino)– Same issues

• Xyleme XML Native storage

Storing XML

Page 45: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 45

45

• Goal– minimize I/O for direct access and scanning– efficient direct accesses both with fulltext indexing

and structure indexing– good compaction but not at the cost of access

• Efficient storage of trees – use fixed length storage pages – variable length records inside a page

• Main issue: tree balancing

Repository

Page 46: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 46

46

Record 1

Record 3Record 2

Tree Balancing

Page 47: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 47

47

Large collections may useseveral records

Tree Balancing

Page 48: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 48

48

3. Semantic Data Integration

Page 49: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 49

49

• Based on word occurrences in document and statistical resources– Classification by semantic domain– Classification by language

• Use the XXX classifier

Classification

Page 50: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 50

50Semantic Integration

• Web Heterogeneity

• Many possible types for data in a particular domain, many DTDs

• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an

homogeneous database for this domain

1 domain = 1 abstract DTD

Page 51: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 51

51

• Choose an abstract DTD for each domain– For each concrete DTD in a domain, find how it relates to the abstract DTD using linguistic tools such as WordNet– Provide relationships between paths in the concrete and abstract DTD– Possibly automatic, manual or hybrid

• With manual mapping, a domain expert may specify much more complex views• Query processing: process queries on the Abstract DTD

Views

Page 52: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 52

52

4. Query Processing

Page 53: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 53

53

• Today: A mix of OQL and XQL• Tomorrow: the future W3C standard • Example

select product/name, product/pricefrom doc in catalogue,

product in doc/productwhere product//components contains “flash”

and product/description contains “camera”

Query Language

Page 54: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 54

54

• Cluster of documents = physical collection of documents ( semantic domain)

Distribution• Storage machine

– in charge of a cluster of documents

• Index machine– index for a cluster

Data Distribution

Page 55: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 55

55

• Standard inverted index– word documents that contain this word

• Xyleme index– word elements that contain this word

document + element identifier

• Goal: more work can be performed without accessing data

Step0: Indexing

Page 56: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 56

56

• Query on an abstract dtd

• Localization of machines that host concrete DTDs that will participate in the query

global query on abstract dtd

union of querieson local machines

local queries

catalogue/product/pricerelevant for

machine 56machine 45

Step1: Localization

Page 57: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 57

57

• Algebraic rewriting

• Linear search strategy based on simple heuristics– use in memory indexes– minimize communication

• Optimization of the global plan

• Optimization of the local plans

Step2: Optimization

Page 58: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 58

58

A plan usually consists of:1. parallel translation from abstract queries to

concrete patterns on the relevant index machines

2. parallel index scans to identify the relevant elements for a concrete pattern

3. parallel construction of resulting elements

4. pipeline evaluation (i.e., no intermediate data structure)

Note: 2. Requires smart indexes

Step3: Execution

Page 59: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 59

59

For each concrete pattern,

the local plan is optimized dynamically

for each concrete patternscan the element ids

&234 &177

for catalogue/product/pricescan relevant concrete pattern

d1//camera/price d2/product/cost d3/piano/price ...

Abstract2Concrete

Page 60: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 60

60

• Essential for query processing

• Identifier = (preorder rank/postorder rank)– X ancestor of Y <=>

pre(X) < pre(Y) and post(X) > post(Y)

– E.g., 2<5 and 4 >2 => (2,4) ancestor (5,2)

A B C

D E F

G

1

2

3 4

5

6

71

2

3

4

5

6

7

Text

Identifiers

Page 61: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 61

61

5. Change Control

Page 62: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 62

62

Users are often interested in changes to the web

• Change monitoring– query subscription

• Soon to come: Version management– representation and storage of changes

Change management

Page 63: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 63

63

• Users subscribe to certain events such as• Update of a particular page, a page in a given site• Discovery of a new page containing some specific words • Insertion of a particular element in some pages (new products in a catalog)• Detection of illegal copies of selected documents

• Users may request to be notified • Immediately at the time the event is detected• Regularly, e.g., weekly• After a certain number of event detections

Query Subscription

Page 64: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 64

64

subscription myPariscope% what are the new movie entries in Pariscope sitemonitoring newMovies

select URLwhere URL extends www.pariscope.fr/movies/*and new(self)

% manage the changes in the movies showing in Paris continuous delta Showing

select ... from ... where when daily

notify daily % send me a daily report

Examples

Page 65: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 65

65

HTMLparser

XMLloader

metadatamanager

d/46complex

event detection

atomic event 46: URL matches pattern www.xyz.com/*atomic event 67: XML documentcontains the tag soccer

d/46,67

Loading of millions of pages/day d

loading

document

Atomic Events

Page 66: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 66

66

HTMLparser

XMLloader

complexevent detection

complex event 12: 67 & 46 (XML document contains the tag soccer and URL matches pattern www.xyz.com/*)

Several millions of pages crawled per dayHundreds of millions of alerts raised

Millions of subscriptions

Complex Events

Page 67: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 67

67

• Very efficient/scalable algorithm for complex event detection

• Notifications by – Email– Web posting– Web services in SOAP

notificationprocessor Millions of

notifications/day

complexevent detectionalerts notifications

Notification Processing

Page 68: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 68

68

Xyleme in short

• Spin-off of l’INRIA (National Research Institute)– Technology developed in research project of 60 man/years

• Creation of Xyleme SA in September 2000 • Now about 25 persons : 13 R&D, 4 Services, 10

marketing, sales & admin.

• Customers include: Press agency (AFP), Newspaper groups (Moniteur, Le Monde), National library (BNF)

• First round of capital in 2000 (SGAM & Viventures).• Second round in 2002 (Deutsche Bank)

Page 69: 1 XML warehouse – 2002 1 XML Warehousing and Xyleme S. Abiteboul INRIA and Xyleme Serge.Abiteboul@inria.fr December 2002

XML warehouse – 2002 69

69

Thank you

(*) If you want to know more about Xylemehttp://www.xyleme.com [email protected]@xyleme.com