efficient xml storage, query, and update

45
Efficient XML Efficient XML Storage, Query, Storage, Query, and Update and Update Shi Xu Shi Xu Heng Yuan Heng Yuan Spring 2004 CS240B Spring 2004 CS240B Prof. Zaniolo Prof. Zaniolo

Upload: adrienne-orr

Post on 01-Jan-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Efficient XML Storage, Query, and Update. Shi Xu Heng Yuan Spring 2004 CS240B Prof. Zaniolo. XML Storage Methods. Flat Streams Metamodeling Mixed Redundant Hybrid. Method Covered. “Efficient storage of XML data” covers hybrid method using a custom made storage system called Natix. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient XML Storage, Query, and Update

Efficient XML Efficient XML Storage, Query, Storage, Query,

and Updateand UpdateShi XuShi Xu

Heng YuanHeng YuanSpring 2004 CS240BSpring 2004 CS240B

Prof. ZanioloProf. Zaniolo

Page 2: Efficient XML Storage, Query, and Update

XML Storage MethodsXML Storage Methods

Flat StreamsFlat Streams MetamodelingMetamodeling MixedMixed

RedundantRedundant HybridHybrid

Page 3: Efficient XML Storage, Query, and Update

Method CoveredMethod Covered

““Efficient storage of XML data” Efficient storage of XML data” covers hybrid method using a covers hybrid method using a custom made storage system called custom made storage system called Natix.Natix.

““Efficient relational storage and Efficient relational storage and retrieval of XML documents” covers retrieval of XML documents” covers Metamodeling using their Monet Metamodeling using their Monet database.database.

Page 4: Efficient XML Storage, Query, and Update

Natix OverviewNatix Overview

Natix is an efficient, native Natix is an efficient, native repository for storing, retrieving and repository for storing, retrieving and managing XML documents.managing XML documents.

It supports tree-structured objects It supports tree-structured objects like XML documents at low like XML documents at low architecture level.architecture level.

Page 5: Efficient XML Storage, Query, and Update

Natix architectural Natix architectural overviewoverview

Page 6: Efficient XML Storage, Query, and Update

Logic ModelLogic Model

Tree is often used in logic model of Tree is often used in logic model of semistructured data.semistructured data.

Each non-leaf node is labeled with a Each non-leaf node is labeled with a symbol taken from an alphabet symbol taken from an alphabet DTD.DTD.

Leaf nodes can be labeled as the Leaf nodes can be labeled as the data itself. data itself.

Page 7: Efficient XML Storage, Query, and Update

A sample XML with its A sample XML with its associated logical treeassociated logical tree

Example XML:

<SPEECH><SPEAKER>OTHELLO</SPEAKER><LINE>Let me see your eyes;</LINE><LINE>Look in my face.</LINE>

</SPEECH>

Page 8: Efficient XML Storage, Query, and Update

Physical ModelPhysical Model Object Content:Object Content:

NodeNode and and objectsobjects are used interchangeably. are used interchangeably. A A recordrecord contains a set of nodes/objects. contains a set of nodes/objects. Aggregate nodesAggregate nodes are inner nodes of the are inner nodes of the

tree. They contain their respective child tree. They contain their respective child nodes.nodes.

Literal nodesLiteral nodes are leaf nodes containing an are leaf nodes containing an uninterpreted stream of bytes, like text uninterpreted stream of bytes, like text strings, graphics, etc.strings, graphics, etc.

Proxy nodesProxy nodes are nodes which point to are nodes which point to different records.different records.

Page 9: Efficient XML Storage, Query, and Update

Node RepresentationNode Representation

Whole documents (or subtrees of Whole documents (or subtrees of documents) can be stored in one record.documents) can be stored in one record.

Each record contains exactly one subtree.Each record contains exactly one subtree. The root nodes of each record’s subtree The root nodes of each record’s subtree

are called are called standalone objectsstandalone objects, other , other nodes are called nodes are called embedded objectsembedded objects..

The record size has an upper limit, the The record size has an upper limit, the page sizepage size..

Page 10: Efficient XML Storage, Query, and Update

Large TreesLarge Trees

For a large tree, physical model must For a large tree, physical model must provide a mechanism for distributing provide a mechanism for distributing data trees over several pages.data trees over several pages.

Method 1: “flat” representation. It Method 1: “flat” representation. It wastes the available structural wastes the available structural information about the data.information about the data.

Method 2: split large objects based on Method 2: split large objects based on the underlying tree structure.the underlying tree structure. Use proxy objects to connect subtrees of Use proxy objects to connect subtrees of

the large object residing in other records.the large object residing in other records.

Page 11: Efficient XML Storage, Query, and Update

A Sample Distribution of logical nodes A Sample Distribution of logical nodes on recordson records

Proxies (p1, p2)Proxies (p1, p2) Helper aggregate objects (h1, h2)Helper aggregate objects (h1, h2) Scaffolding objects include proxies and helper Scaffolding objects include proxies and helper

aggregates.aggregates. Facade objects (Facade objects (f f i)i)

Page 12: Efficient XML Storage, Query, and Update

Dynamic maintenance of an Dynamic maintenance of an efficient storageefficient storage

The principle problem is that a The principle problem is that a record containing a subtree can record containing a subtree can grow larger than a page if a node is grow larger than a page if a node is added or grows.added or grows.

Subtree contains in the record has to Subtree contains in the record has to be partitioned into several subtrees. be partitioned into several subtrees.

Scaffolding nodes link the new Scaffolding nodes link the new records together in the physical records together in the physical tree.tree.

Page 13: Efficient XML Storage, Query, and Update

Multiway tree Multiway tree representation of recordsrepresentation of records

Page 14: Efficient XML Storage, Query, and Update

Tree Growth ProcedureTree Growth Procedure

Step 1: Determine the record r into which the Step 1: Determine the record r into which the node has to be inserted.node has to be inserted.

Step 2: If there is not enough on the page, try Step 2: If there is not enough on the page, try to move r. If the record still does not fit, split to move r. If the record still does not fit, split the record:the record: (a) Determine the separator by recursively (a) Determine the separator by recursively

descending into the r’s subtreedescending into the r’s subtree (b) Distribute the resulting partitions onto records(b) Distribute the resulting partitions onto records (c) Insert the separator into the parent record, (c) Insert the separator into the parent record,

recursively calling this procedurerecursively calling this procedure Step 3: Insert the new nodeStep 3: Insert the new node

Page 15: Efficient XML Storage, Query, and Update

Determining the Insertion Determining the Insertion LocationLocation

There are several possibilities to insert a new node There are several possibilities to insert a new node f f n into n into the physical tree.the physical tree.

This choice can be determined by a configuration parameters.This choice can be determined by a configuration parameters.

Page 16: Efficient XML Storage, Query, and Update

Determining the Determining the separatorseparator

Separator – a tree structure with Separator – a tree structure with proxies pointing to the new records proxies pointing to the new records to indicate where which part of the to indicate where which part of the old record was moved.old record was moved.

Consists of all the nodes on the path Consists of all the nodes on the path from d to the subtree’s root.from d to the subtree’s root.

Partition the tree into left partition Partition the tree into left partition L, right partition R and Separator S.L, right partition R and Separator S.

Page 17: Efficient XML Storage, Query, and Update

A record’s subtree before a A record’s subtree before a split occurssplit occurs

Page 18: Efficient XML Storage, Query, and Update

Splitting a RecordSplitting a Record

Distributing the nodes on recordsDistributing the nodes on records After determining the partitioning, the After determining the partitioning, the

contents of the record has to be contents of the record has to be distributed onto new records.distributed onto new records.

Each resulting subtree is then stored in Each resulting subtree is then stored in its own record, called partition records.its own record, called partition records.

Inserting the separatorInserting the separator The separator is moved to the parent The separator is moved to the parent

record.record.

Page 19: Efficient XML Storage, Query, and Update

Split AlgorithmSplit Algorithm

Find a node d, such that the Find a node d, such that the resulting L and R.resulting L and R.

The ratio between the sizes of L and The ratio between the sizes of L and R is determined by a configuration R is determined by a configuration parameter (split target).parameter (split target).

Another configuration parameter Another configuration parameter Split tolerance specifies the Split tolerance specifies the minimum size for the subtree of d. minimum size for the subtree of d. It is used to prevent fragmentation.It is used to prevent fragmentation.

Page 20: Efficient XML Storage, Query, and Update

Record assembly for the Record assembly for the subtree from previous subtree from previous

figurefigure

Page 21: Efficient XML Storage, Query, and Update

Physical storage of the tree Physical storage of the tree represented inside one represented inside one

recordrecord

Page 22: Efficient XML Storage, Query, and Update

Performance TestPerformance Test

XML markup version of XML markup version of Shakspeare’s play with 8MB with Shakspeare’s play with 8MB with 320,000 nodes.320,000 nodes.

Pentium-II 333Mhz with 128MB Pentium-II 333Mhz with 128MB under Windows NT4.0 with IBM under Windows NT4.0 with IBM DCAS 34330 disk.DCAS 34330 disk.

The implementation of the record The implementation of the record and tree storage managers was done and tree storage managers was done in C++.in C++.

Page 23: Efficient XML Storage, Query, and Update

Test ConditionsTest Conditions

Record:Node 1:1 indicating smart Record:Node 1:1 indicating smart record splitting being inhibited.record splitting being inhibited.

Record:Node 1:n indicating that the Record:Node 1:n indicating that the algorithm has full control over algorithm has full control over distribution of nodes on records.distribution of nodes on records.

Incremental updates distributed Incremental updates distributed over the whole document.over the whole document.

Updates in pre-order (append).Updates in pre-order (append).

Page 24: Efficient XML Storage, Query, and Update

InsertionInsertion

Page 25: Efficient XML Storage, Query, and Update

Full tree traversalFull tree traversal

Page 26: Efficient XML Storage, Query, and Update

QueriesQueries

Retrieve all speakers in the third act and Retrieve all speakers in the third act and second scene of every play, which means it second scene of every play, which means it accesses all leaf nodes of a certain type in accesses all leaf nodes of a certain type in one selected subtree of the document.one selected subtree of the document.

Recreate the textual representation of the Recreate the textual representation of the complete first speech in every scene, hence complete first speech in every scene, hence reading a lot of small contiguous fragments reading a lot of small contiguous fragments of each document.of each document.

A simple path query was evaluated by A simple path query was evaluated by reading only the opening speech of each reading only the opening speech of each play.play.

Page 27: Efficient XML Storage, Query, and Update

Selection on leaf nodes of Selection on leaf nodes of document subtreedocument subtree

Page 28: Efficient XML Storage, Query, and Update

Small contiguous Small contiguous fragmentsfragments

Page 29: Efficient XML Storage, Query, and Update

Single path for each Single path for each documentdocument

Page 30: Efficient XML Storage, Query, and Update

Space requirementsSpace requirements

Page 31: Efficient XML Storage, Query, and Update

Monet ModelMonet Model

XML document is decomposed into XML document is decomposed into binary relations.binary relations.

Efficient for storage and retrieval of Efficient for storage and retrieval of XML documents in a relational XML documents in a relational database.database.

The database used is their Monet The database used is their Monet database server which supports the database server which supports the Monet model.Monet model.

Page 32: Efficient XML Storage, Query, and Update

Some DefinitionsSome Definitions

An XML document is a rooted treeAn XML document is a rooted treed = (V, E, r, labeld = (V, E, r, labelEE, label, labelAA, rank) with nodes V , rank) with nodes V and edges Eand edges EVVV and a distinguished node V and a distinguished node rrV.V.

The function labelThe function labelEE : V : Vstringstring assigns labels to assigns labels to nodesnodes

labellabelAA : V : Vstringstringstringstring assigns pairs of assigns pairs of strings, attributes and their values, to nodes.strings, attributes and their values, to nodes.

rank : Vrank : Vintint establishes a ranking to allow for establishes a ranking to allow for an order among nodes with the same parent an order among nodes with the same parent node.node.

Page 33: Efficient XML Storage, Query, and Update

A sample XML documentA sample XML document

<bibliography><article key=“BB88”>

<author>Ben Bit</author><title>How to Hack</title>

</article><article key=“BK99”>

<editor>Ed Itor</editor><author>Bob Byte</author><author>Ken Key</author><title>Hacking & RSI</title>

</article></bibliography>

Page 34: Efficient XML Storage, Query, and Update

Syntax Tree of the Previous Syntax Tree of the Previous XML DocumentXML Document

Page 35: Efficient XML Storage, Query, and Update

Monet TransformMonet Transform

Given an XML document d, the Monet Given an XML document d, the Monet transform is a quadruple transform is a quadruple MMtt(d)=((d)=(rr,,RR,,AA,,TT) where) where RR is the set of binary relations that contain is the set of binary relations that contain

all associations between nodes;all associations between nodes; AA is the set of binary relations that contain is the set of binary relations that contain

all associations between nodes and their all associations between nodes and their attribute values, including character data;attribute values, including character data;

TT is set of binary relations that contain all is set of binary relations that contain all pairs of nodes and their rank;pairs of nodes and their rank;

rr is the root of the document; is the root of the document;

Page 36: Efficient XML Storage, Query, and Update

Monet Transform of the Monet Transform of the Example DocumentExample Document

Page 37: Efficient XML Storage, Query, and Update

OQL-like queryOQL-like query

Page 38: Efficient XML Storage, Query, and Update

Query HandlingQuery Handling

Page 39: Efficient XML Storage, Query, and Update

AssessmentAssessment

Implemented within the Monet Implemented within the Monet database serverdatabase server

Tested on 550 MHz Silicon Graphics Tested on 550 MHz Silicon Graphics 1400 Server with 1 GB main 1400 Server with 1 GB main memory.memory.

Also used Sun UltraSparc-IIi with Also used Sun UltraSparc-IIi with 360 MHz and 256 MB main memory 360 MHz and 256 MB main memory to contrast with a related work.to contrast with a related work.

Page 40: Efficient XML Storage, Query, and Update

Size of document Size of document collections in XML and collections in XML and

Monet XML formatMonet XML format

Page 41: Efficient XML Storage, Query, and Update

Scaling of DocumentScaling of Document•Scaled the ACM Anthology from 30 to 3x106 which corresponds to XML source size between 10KB and 1GB.

•Run 4 queries consisting of path expressions of length 1 through 4 for various sizes of the anthology.

Page 42: Efficient XML Storage, Query, and Update

Response Time vs. Result Response Time vs. Result SizeSize

Page 43: Efficient XML Storage, Query, and Update

Comparison of response Comparison of response time for query set of SYU, time for query set of SYU,

another method for another method for storage/retrieval of XML storage/retrieval of XML

document.document.

Page 44: Efficient XML Storage, Query, and Update

Compare/Contrast Natix Compare/Contrast Natix and Monetand Monet

Natix uses custom database while Monet Natix uses custom database while Monet is built on top of relational databaseis built on top of relational database

Neither uses DTD.Neither uses DTD. Natix focuses on XML query as well as Natix focuses on XML query as well as

update.update. Monet focuses on XML storage and query.Monet focuses on XML storage and query. Though lacking equivalent test, Monet is Though lacking equivalent test, Monet is

faster than Natix on query.faster than Natix on query. Monet seems to be more space efficient Monet seems to be more space efficient

than Natix as well.than Natix as well.

Page 45: Efficient XML Storage, Query, and Update

ReferencesReferences

““Efficient storage of XML data” By Carl-Efficient storage of XML data” By Carl-Christian Kanne, et al. ICDE 2000 Christian Kanne, et al. ICDE 2000 http://citeseer.nj.nec.com/kanne99efficienhttp://citeseer.nj.nec.com/kanne99efficient.htmlt.html

““Efficient Relational Storage and Efficient Relational Storage and Retrieval of XML Documents” By Albrecht Retrieval of XML Documents” By Albrecht Schmidt, et al. WebDB 2000 Schmidt, et al. WebDB 2000 http://www.research.att.com/conf/webdb2http://www.research.att.com/conf/webdb2000/program.html000/program.html