1 xml warehouse – 2002 1 xml warehousing and xyleme s. abiteboul inria and xyleme...
TRANSCRIPT
XML warehouse – 2002 2
2
Organization
• The context and motivations • XML warehouse• Xyleme: An XML warehouse
Zooms on some aspects of the technology– Scaling– Mass storage of XML– XML query processing– Semantic integration– Web page ranking– Query subscription
• Xyleme : the company, in very brief
XML warehouse – 2002 3
3
The context
The Web and XML are changing dramatically the world of distributed information
XML warehouse – 2002 4
4
The Web of yesterday
• Protocol: HTTP• Documents: HTML• Millions of independent web sites and billions of
documents• Browsing and keyword search (full-text indexing)• Publication of databases using forms• Data management with the Web
– HTML is primarily for humans– Data management applications on the Web
• Based on hand-made wrappers• Expensive, incomplete, short-lived, not adapted to the Web constant
change
• No real support for distributed data management!
XML warehouse – 2002 5
5What is changing
Information used to live in islands and a lot of its value was wasted
1. Different formats: relational, meta data, documents and text, data exchange formats…
– A Web standard for data exchange, XML, is fixing it – XML can capture all kinds of information over a wide spectrum
of information– XML comes with a family of emerging standards: XML
schema, XSL/T, Xquery, domain specific schemas…
2. Different computers, platforms, languages, applications– Web services, e.g., SOAP, are fixing it– SOAP allows ubiquitous computing on the Internet– SOAP comes with a family of emerging standards: WSDL,
UDDI
XML warehouse – 2002 6
6
What is changing
• XML and Web services provide a uniform access to information, independent of platform, system, language, communication protocol and data format…
• The dream for distributed data management
• The gathering, integration, consolidation, analysis of distributed information become feasible at a much lower cost
XML warehouse – 2002 7
7
(1) XML covers the information spectrum
Structured Data
Minimal structure
Meta dataHierarchy +
Books Contracts Catalogs Bank accounts
Emails Financial Reports Insurance Policies
Economical Analysis Derivatives Inventory
Political analysis Insurance Claims
Financial News Sports News Resumes
XML warehouse – 2002 8
8
• Very structured information such as databases– Most DBMS now export in XML
• Semi-structured data such as data exchange formats (ASN.1, SGML), e.g., technical documentation
• Documents – Meta-data: Author, date, status– Existing structure in them: chapter, section, table of
content and index– Possibly tagging of elements in it (citation, lists)– Links to other documents
• Meta data for unstructured data such as images and sound
• Plain text
XML covers the information spectrum
XML
XML warehouse – 2002 9
9XML’s asset: the marriage of text and structure
labeled ordered trees where leaves are text• Marriage of document and database worlds• Marriage of full text indexing (keyword search)
and structure indexing (SQL-style query)
• Is it the ultimate data model? No• Purely syntax – more semantics needed• Is it OK for now? Definitely yes (because it is a
standard)
XML warehouse – 2002 10
10
XML’s asset: typing
• Applications need typing and XML data can be typed if needed (DTD and XML schema)
• Trees
• Logical Granularity – neither page or document level – but the piece of information that is needed
• Semantics and structure are in tags and paths– product-table/product/reference– product-table/product/price
product
designation descriptionprice
reference
product-table
XML warehouse – 2002 11
11
HTML
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99
Information System HTML
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.The new robot <b>R2D2</b>…
Text + presentation - Where is the data ?
hard
XML warehouse – 2002 12
12
XML
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...
Information System
XML
<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table>
Data + Structure = Semistructured(presentation elsewhere)
easy
XML warehouse – 2002 13
13(2) Web services and ubiquitous distributed computing
• Possibility to activate a method on some remote web server
• Exchange information in XML: input and result are in XML
• Ubiquitous XML distributed computing infrastructure• 2 main applications
– E-commerce– Access to remote data
• With XML and Web services, it is possible– To get information from virtually anywhere– To provide information to virtually anywhere
XML warehouse – 2002 14
14
Accessing remote information
Application using gene banks
Query some data services that provide candidate genes
Gene banks
processing
processingprocessing
Use some processing services
Heterogeneous formats,
protocols, etc.
XML warehouse – 2002 15
15
Same with web services
Query some data services that provide candidate genes
Gene banks
processing
processingprocessing
Use some processing services
Web
Application using gene banks
Uniform access to information
XML warehouse – 2002 16
16
XML and Web services
• Exchange of information– E-commerce, B2B, G2C– Cooperative work
• Information brokers– Web sites, portals– Content publication in general
• Mediation mode: get the XML pages when needed• Warehouse mode: load them in advance
XML warehouse – 2002 17
17
Advantages of a warehouse approach
• Allows for support of complex query processing with high performance
• Allows for complex analysis of the data • Allows for enriching the information• Allows for better monitoring of information• Allows for versioning, archiving, temporal queries if
needed
• Mediator approach is preferable or compulsory in some applications – Supply chain – Comparative shopping– Typically for volatile information such as plane ticket price
XML warehouse – 2002 18
18
XML warehouse
XML warehouse – 2002 19
19
Main functionalities
Feeding
Enrichment
ExploitationRepository
Admin GUI
User GUIEditing &
Pub
AccessReporting
Sub
View &Integration
User GUI
AP
I
AP
I
Warehousing Analysis(data warehouse) (OLAP)
XML warehouse – 2002 20
20Main functionalities
(1) Feeding• Loading from the Web (Internet and Intranet)
– Web search– Web crawl– Access Web data via forms or Web services
• Plug-ins to load from – File systems, document management systems– Data bases, LDAP– Newsgroup, emails– Other applications
• Extraction and transformation– XSL-T or Xquery mappings for XML sources– XML-izers to load data from other formats
• Monitoring of the feeding
XML warehouse – 2002 21
21Main functionalities
(1) Feeding – continued • User feeding
– Document editing– Meta data editing– Using WebDAV protocol
• Publication
• By GUI or from programs (SOAP-based API)
XML warehouse – 2002 22
22Main functionalities
(2) Repository• Storage of massive volume of XML (terabytes)• Indexing of massive volume of XML
– By structure– By full-text– Linguistic support: stemming, synonyms, etc.
• Very efficient XML query processing• Importance ranking• Monitoring of the warehouse (support for subscriptions)• Access control and security• Versioning, archiving• Recovery
• No full transaction mechanism
XML warehouse – 2002 23
23Main functionalities
(3) Enrichment• Global organization
– Global schema management • Management of collections
– Incorporate domain ontologies and thesauri– Document classification– Cleaning by filtering out documents from collections, etc.
• Document enrichment– Concept extraction and tagging– Cleaning inside de document– Summarization, etc.
• Relationships between documents– Tables of contents– Tables of index– Cross referencing, etc.
XML warehouse – 2002 24
24Main functionalities
(4) View and integration• View management
– Document restructuring/mapping– Schema to schema mapping
• Semantic integration– Manual for complex ones and (semi-) automatic for
simple ones– Tools to analyze a set of schemas– Tools to integrate them – Processing for queries on integration view
• Management of virtual data in a mediator style
XML warehouse – 2002 25
25Functionalities(5) Exploitation
• Access to the warehouse– Browsing– Querying by keywords, XPaths or Xquery – Temporal queries
• Query subscription • Reporting
– Generation of complex reports with pointers to documents, counts, abstracts…
– Organized by collections, content, domains…
• By GUI or from programs (Web service-based API)
XML warehouse – 2002 26
26Admin: Specify the lifecycle of information in the warehouse starting from its acquisition
• Specify with parameters (in red): documents to process• Add from a toolbox, some processing to apply (in pink)• Specify when processing should be applied (in green)
Loading from/u/news/*start now
TransformationBy some
XML-izer Xflow
Storage in Collection Z
flow
Classificationoff flow
Concept Taggingoff flow
Indexingflow
Monitoringof Yflow
XML warehouse – 2002 27
27
Specifying the enrichment
• What processing should be performed– Applications that come with the system– Arbitrary processing provided as Web services
• Interface of services– XML input: the documents or collection of documents in the
warehouse to be processed– XML output: the result
• Where to plug the result– Where to store the new documents (collections, names)– Where to put enrichments in existing documents
• When to start the processing– At the time the document is loaded– At some later time, assuming some information has already
been gathered (dependencies)
XML warehouse – 2002 28
28
Choose presentation style
User: queries and reporting
Choose thecollectionsof interest
Classify/group results for presentation and drilling
Quantity of resultsPreference ranking and possible relaxation
Choose the criteriaof selection
Choose what to extract as a result
WHERE CLAUSE SELECT CLAUSEFROM CLAUSE
PREFER CLAUSE
ORGANIZECLAUSE
STYLE CLAUSE
XML warehouse – 2002 29
29
Example
From collections MuséeRodin, WebMuseum, LACMAWhere Art_Item/ artist [Name=“Rodin”]Select Name, Owner, AnnotationsPrefer
1. Rodin in title page2. Owner is public or owner is in France– Get first 20
Organize as 1. Art_Item/material sculpture, painting, others2. Owner
Present as …
XML warehouse – 2002 30
30
XylemeAn XML warehouse
Zooms on some aspects of the technology
XML warehouse – 2002 31
31
Xyleme: a dynamic XML warehouse
• Scaling• Feeder
– E.g., loading with a single PC millions of Web documents per day – and scale up with more machines
• Repository – E.g., storing and indexing of tera Bytes of XML (other formats,
e.g., pdf)• Enrichment
– E.g., tools (together with partner) for classification and concept extraction
• View and semantic integration– E.g., a suite of tools of XML integration
• Exploitation– E.g., access via SOAP and graphic interfaces
XML warehouse – 2002 32
32
1. An architecture to scale
XML warehouse – 2002 33
33
The scaling
• Size of data: billions of XML documents • Size of data and index: terabytes• Number of customers
– thousands of simultaneous queries– millions of subscriptions
An architecture based on distribution
XML warehouse – 2002 34
34
Architecture
• Cluster of PCs
• Runs on Linux and C++ (also Solaris)
• Communications– local: Corba (Orbacus)– external: HTTP, SOAP
• Distribution between autonomous machines
XML warehouse – 2002 35
35
Functional architecture
Repository and Index Manager
Change Control
Query Processor
Semantic Module
User Interface
Xyleme Interface
-------------------- I N T E R N E T -----------------------
Web Interface
Acquisition& Crawler
Loader
XML warehouse – 2002 36
36
Architecture and scaling
Index Index Index
-------------------- I N T E R N E T -----------------------
Change Control andSemantic
Integration
Change Control andSemantic
Integration
ETHERNET
Repository Repository RepositorryRepository
Loader |Query Loader |Query
Acquisition andMaintenance
Acquisition andMaintenance
XML warehouse – 2002 37
37
2. Data Acquisition and Maintenance of Web pages (internet or intranet)
XML warehouse – 2002 38
38
• Discover HTML/XML pages on the web (intranet or internet)
• Parse/load pages and follow links• Manage metadata for the known pages• Do this under bounded resources
– Network bandwidth– Memory and disk resources
• Tested on the Internet in October 2001– Millions of pages crawled per day on each crawler– Up to 10 crawlers and close to 1 billion HTML/XML pages
discovered in a couple of months
Crawl le Web
XML warehouse – 2002 39
39
• Optimization problem– Decide which page to crawl or refresh next to
optimize the quality of the warehouse
• Criteria:– Read more often important pages
• Based on customer’s preferences• Page importance can also be used to order query results
– Don’t read a page that is probably up-to-date• Uses an estimate of the change frequency for each page
• Advantages– Have a fresh view of useful portions of information
Page SchedulingOptimization
XML warehouse – 2002 40
40
• Determine which page to read next – minimize a particular cost function under some
constraint (bandwidth of crawlers)
• The penalty for a page takes into account:– importance of the page (to be defined next)– customer needs (obtained via pub/sub)– staleness of the data
• penalty for being out of date• penalty for aging
• The page scheduler fully controls the crawling– vs. random crawling in classic search engines
Page scheduling
XML warehouse – 2002 41
41
• Based on customer’s criteria and on the link structure of the web
• Intuition: a page is important if many important pages reference it
• Fixpoint definition: importance vector Imp– Proposed by IBM; used by search engines such as Google – Link matrix: M(i,j) if page i refers to page j– Outdegree of page i: out(i)– Imp0(k) = 1/N (initialization)– Impm(k) = i [M(i,k) * Impm-1(i)/out(i) ] (iteration)– Imp is the limit
Page Importance
XML warehouse – 2002 42
42
• Novel technology developed by Xyleme
• Patent pending
• On-line evaluation of page importance
• Use much less resources
• Faster reaction to changes on the web
Page Importance
XML warehouse – 2002 43
43
2. XML Repository
XML warehouse – 2002 44
44
• Document systems– Good for keyword search– No or inefficient support for structure search
• Relational store (e.g., Oracle 8i)– Well adapted for some applications– Very typed data and Tables: efficient– Otherwise: too many joins and inefficient
• Object database store (e.g., Excellon) and Native XML databases (e.g., Tamino)– Same issues
• Xyleme XML Native storage
Storing XML
XML warehouse – 2002 45
45
• Goal– minimize I/O for direct access and scanning– efficient direct accesses both with fulltext indexing
and structure indexing– good compaction but not at the cost of access
• Efficient storage of trees – use fixed length storage pages – variable length records inside a page
• Main issue: tree balancing
Repository
XML warehouse – 2002 46
46
Record 1
Record 3Record 2
Tree Balancing
XML warehouse – 2002 47
47
Large collections may useseveral records
Tree Balancing
XML warehouse – 2002 48
48
3. Semantic Data Integration
XML warehouse – 2002 49
49
• Based on word occurrences in document and statistical resources– Classification by semantic domain– Classification by language
• Use the XXX classifier
Classification
XML warehouse – 2002 50
50Semantic Integration
• Web Heterogeneity
• Many possible types for data in a particular domain, many DTDs
• Semantic Integration– one abstract DTD for the domain– gives the illusion that the system maintains an
homogeneous database for this domain
1 domain = 1 abstract DTD
XML warehouse – 2002 51
51
• Choose an abstract DTD for each domain– For each concrete DTD in a domain, find how it relates to the abstract DTD using linguistic tools such as WordNet– Provide relationships between paths in the concrete and abstract DTD– Possibly automatic, manual or hybrid
• With manual mapping, a domain expert may specify much more complex views• Query processing: process queries on the Abstract DTD
Views
XML warehouse – 2002 52
52
4. Query Processing
XML warehouse – 2002 53
53
• Today: A mix of OQL and XQL• Tomorrow: the future W3C standard • Example
select product/name, product/pricefrom doc in catalogue,
product in doc/productwhere product//components contains “flash”
and product/description contains “camera”
Query Language
XML warehouse – 2002 54
54
• Cluster of documents = physical collection of documents ( semantic domain)
Distribution• Storage machine
– in charge of a cluster of documents
• Index machine– index for a cluster
Data Distribution
XML warehouse – 2002 55
55
• Standard inverted index– word documents that contain this word
• Xyleme index– word elements that contain this word
document + element identifier
• Goal: more work can be performed without accessing data
Step0: Indexing
XML warehouse – 2002 56
56
• Query on an abstract dtd
• Localization of machines that host concrete DTDs that will participate in the query
global query on abstract dtd
union of querieson local machines
local queries
catalogue/product/pricerelevant for
machine 56machine 45
Step1: Localization
XML warehouse – 2002 57
57
• Algebraic rewriting
• Linear search strategy based on simple heuristics– use in memory indexes– minimize communication
• Optimization of the global plan
• Optimization of the local plans
Step2: Optimization
XML warehouse – 2002 58
58
A plan usually consists of:1. parallel translation from abstract queries to
concrete patterns on the relevant index machines
2. parallel index scans to identify the relevant elements for a concrete pattern
3. parallel construction of resulting elements
4. pipeline evaluation (i.e., no intermediate data structure)
Note: 2. Requires smart indexes
Step3: Execution
XML warehouse – 2002 59
59
For each concrete pattern,
the local plan is optimized dynamically
for each concrete patternscan the element ids
&234 &177
for catalogue/product/pricescan relevant concrete pattern
d1//camera/price d2/product/cost d3/piano/price ...
Abstract2Concrete
XML warehouse – 2002 60
60
• Essential for query processing
• Identifier = (preorder rank/postorder rank)– X ancestor of Y <=>
pre(X) < pre(Y) and post(X) > post(Y)
– E.g., 2<5 and 4 >2 => (2,4) ancestor (5,2)
A B C
D E F
G
1
2
3 4
5
6
71
2
3
4
5
6
7
Text
Identifiers
XML warehouse – 2002 61
61
5. Change Control
XML warehouse – 2002 62
62
Users are often interested in changes to the web
• Change monitoring– query subscription
• Soon to come: Version management– representation and storage of changes
Change management
XML warehouse – 2002 63
63
• Users subscribe to certain events such as• Update of a particular page, a page in a given site• Discovery of a new page containing some specific words • Insertion of a particular element in some pages (new products in a catalog)• Detection of illegal copies of selected documents
• Users may request to be notified • Immediately at the time the event is detected• Regularly, e.g., weekly• After a certain number of event detections
Query Subscription
XML warehouse – 2002 64
64
subscription myPariscope% what are the new movie entries in Pariscope sitemonitoring newMovies
select URLwhere URL extends www.pariscope.fr/movies/*and new(self)
% manage the changes in the movies showing in Paris continuous delta Showing
select ... from ... where when daily
notify daily % send me a daily report
Examples
XML warehouse – 2002 65
65
HTMLparser
XMLloader
metadatamanager
d/46complex
event detection
atomic event 46: URL matches pattern www.xyz.com/*atomic event 67: XML documentcontains the tag soccer
d/46,67
Loading of millions of pages/day d
loading
document
Atomic Events
XML warehouse – 2002 66
66
HTMLparser
XMLloader
complexevent detection
complex event 12: 67 & 46 (XML document contains the tag soccer and URL matches pattern www.xyz.com/*)
Several millions of pages crawled per dayHundreds of millions of alerts raised
Millions of subscriptions
Complex Events
XML warehouse – 2002 67
67
• Very efficient/scalable algorithm for complex event detection
• Notifications by – Email– Web posting– Web services in SOAP
notificationprocessor Millions of
notifications/day
complexevent detectionalerts notifications
Notification Processing
XML warehouse – 2002 68
68
Xyleme in short
• Spin-off of l’INRIA (National Research Institute)– Technology developed in research project of 60 man/years
• Creation of Xyleme SA in September 2000 • Now about 25 persons : 13 R&D, 4 Services, 10
marketing, sales & admin.
• Customers include: Press agency (AFP), Newspaper groups (Moniteur, Le Monde), National library (BNF)
• First round of capital in 2000 (SGAM & Viventures).• Second round in 2002 (Deutsche Bank)
XML warehouse – 2002 69
69
Thank you
(*) If you want to know more about Xylemehttp://www.xyleme.com [email protected]@xyleme.com