Transcript
Page 1: Projecting XML Documents

9/11/2003 1

Projecting XML Documents

Amélie Marian, Columbia UniversityJérôme Simeon, Bell Laboratories

Page 2: Projecting XML Documents

9/11/2003 2

Motivation

XQuery used in very different environments: XQuery implementations on XML stored in

databases (with indexes). Main-memory XQuery implementations on XML

in files, sent as streams, computed on the fly…

Example Applications: Web Services (e.g., ActiveXML). Telecommunication apps (XML messages). XML documents. Information Integration.

Page 3: Projecting XML Documents

9/11/2003 3

Memory Limitations

Main-memory XQuery implementations cannot handle large documents.Complex XQuery expressions require materialization (DOM).DOM is the bottleneck.

XQuery Processors

Maximum Document

Size

QuipKweeltGalax

Xalan (XSLT)

7Mb17Mb33Mb75Mb

XMark Query 1 on an IBM laptop T23 (256Mb RAM)

Page 4: Projecting XML Documents

9/11/2003 4

Projection: Example<site>

<regions>...</regions> <people>

... <person id="person120"> <name>Wagar Bougaut</name> <emailaddress>mailto:[email protected]</emailaddress> </person> <person id="person121">

<name>Waheed Rando</name><emailaddress>mailto:[email protected]</emailaddress><address> <street>32 Mallela St</street>

<city>Tucson</city> <country>United States</country> <zipcode>37</zipcode></address>

<creditcard>7486 5185 1962 7735</creditcard><profile income="59224.09">...

<site> <regions>...</regions>

<people>... <person id="person120"> <name>Wagar Bougaut</name> <emailaddress>mailto:[email protected]</emailaddress> </person> <person id="person121">

<name>Waheed Rando</name><emailaddress>mailto:[email protected]</emailaddress><address> <street>32 Mallela St</street>

<city>Tucson</city> <country>United States</country> <zipcode>37</zipcode></address>

<creditcard>7486 5185 1962 7735</creditcard><profile income="59224.09">...

XMark Query 1for $b in /site/people/person[@id=“person0”]return $b/name

Less than 2% of original document !

Page 5: Projecting XML Documents

9/11/2003 5

Projection: IntuitionGiven a query:For $b in /site/people/person[@id=“person0”]Return $b/name Most nodes in the input document(s) are not required. Projection operation removes unnecessary nodes. Evaluation of the query on projected document yields

the same results as on the original document.How it works:

Projection defined by set of paths. Static analysis infers sets of paths used within a query.

/site/people/person/site/people/person/@id/site/people/person/name

Page 6: Projecting XML Documents

9/11/2003 6

Projection: Challenges

For an XQuery expression, compute all paths that allow to reach nodes required to evaluate the expression.XQuery is complex: Variables Composition Syntactic Sugar Complex expressions

Have to be able to analyze all of XQuery.

Page 7: Projecting XML Documents

9/11/2003 7

Contributions

Definition of a notion of projection for XML documents, based on path expressions.Static Analysis algorithm for arbitrary XQuery expressions used to infer projection paths.Loading algorithm to build projected XML document.Integration of XML Projection in XQuery processor (Galax).Experimental evaluation of projection on XML query processing.

Page 8: Projecting XML Documents

9/11/2003 8

XML Projection

Similar to relational projection: One key operation. Prunes unnecessary part of the data. Essential for memory management.

Specific problems related to XML: Projection must operate on trees. Requires analysis of the query. Need to address XQuery complexity.

Page 9: Projecting XML Documents

9/11/2003 9

Notation

Projection Paths: Path expressions are noted using XPath

semantics (/site/people/person/@id) “#” notation used when subtree should

be kept (/site/people/person/name#)

Static Analysis: inference rule notationExpr => Paths

Page 10: Projecting XML Documents

9/11/2003 10

Static Analysis: Variables

Variables can be bound to nodes coming form different paths.for $b in /site/people/(teacher | student)return $b/name

Analysis must remember paths to which variable was bound/site/people/teacher

/site/people/student Environment is maintained during path analysis:

Env |- Expr => Paths

Page 11: Projecting XML Documents

9/11/2003 11

Static Analysis: Example

Literals do not require any paths:

Paths are propagated in a sequence:

Static analysis algorithm is correct (see Technical Report)

Env |- Literal => {}

Env |- Expr1 => Paths1Env |- Expr2 => Paths2

Env |- Expr1,Expr2 => Paths1 U Paths2

32 => {}

/a/b => {/a/b}

/a/d => {/a/d}

/a/b,/a/d => {/a/b,/a/d}

Page 12: Projecting XML Documents

9/11/2003 12

Static Analysis: Composition

(if (count (/site/regions/*) = 3)then /site/people/personelse /site/open_auctions/open_auction)/@id

/@id does not apply to /site/regions/*Final set of paths should be/site/regions/*/site/people/person/@id/site/open_auctions/open_auction/@id

Need to differentiate two sets of paths during analysis:

Returned Paths: returned by the expression, further path steps are applied on them.

Used Paths: used to compute the expression.

Env |- Expr => Paths using UPaths

Page 13: Projecting XML Documents

9/11/2003 13

XQuery CoreSubset of XQuery:

Reduced grammar. All operations are made explicit. Same expressive power as XQuery. Removes syntactic sugar. Simplifies complex expressions

Analysis only needs to be applied to a small set of expressions

for $b in /site/people/personreturn $b/name

for . in /return for . in child::site return for . in child::people return for . in child::person return child::name

XQuery

XQuery Core

Page 14: Projecting XML Documents

9/11/2003 14

Optimized Projection

XQuery Core decomposition may lead to redundant paths being kept:/site/site/people/site/people/person/site/people/person/name#

Optimization on inference rules can avoid redundancyDetails in paper, full optimized analysis in technical report

Page 15: Projecting XML Documents

9/11/2003 15

XQuery Processing Architecture

XQueryParser

QueryEvaluation

SAX ParserXML Data

Model Loader

XQueryExpression

Input XMLDocument

XQuery AbstractSyntax tree

XML QueryResult

SAXEvents

PathAnalysis

Projection Paths

Projected DataModel

Document Data Model

Page 16: Projecting XML Documents

9/11/2003 16

Loading Algorithm: Description

Input: Set of projection paths. Document SAX events.

Decide on action to apply on document nodes: Skip: ignore node and its subtree. KeepSubtree: keep node and its subtree. Keep: keep node without its subtree. Move: keep processing SAX events. Current

node is only kept if some of its children are kept.

Keep a set of current paths.

Page 17: Projecting XML Documents

9/11/2003 17

Loading Algorithm: Example

Projection Paths:/a/b/c#/a/d

Document Stream

<a> <g> </b><b> </g> </f><f> </c><c><b> </b><d> </d><e></e> </a>

Current Paths:

Loaded Nodes:

/b/c#/d

Action: MoveSkip

/c#

Keep Subtree

c

f

Keep

b d

/a/b/c#/a/d

a

Similar to XML filtering algorithms

Limitations: - Backward Axis!- Number of current paths can be huge (descendant axis)

Page 18: Projecting XML Documents

9/11/2003 18

Experiments: Settings

XML Projection Evaluation: Effectiveness: projection impact on

different queries. Maximum document size: largest

document that can be processed. Processing time: effect on processing

time.

Experimental Setup: Default XMark document size: 50Mb.

Configuration CPU Cache Size RAM

A 1GHz 256Kb 256Mb

B 550MHz 512Kb 768Mb

C (default) 1.4GHz 256Kb 2Gb

Page 19: Projecting XML Documents

9/11/2003 19

Experiments: Effectiveness

0

1

2

3

4

5

6

7

8

9

10

Siz

e a

s p

erc

en

tag

e o

f th

e s

ize

of

the

o

rig

ina

l d

oc

um

en

t

Projection

OptimizedProjection

100% 100% 33%100% 60%

All queries but one require less than 5% of the document.

Page 20: Projecting XML Documents

9/11/2003 20

Experiments: Maximal Document Size

Configuration A B CXMark Query 3

(simple selection with

predicate)

No Projection 33Mb 220Mb 520MbOptimized Projection

1Gb 1.5Gb 1.5Gb

XMark Query 14 (Non-

selective path query with predicates)

No Projection 20Mb 20Mb 20MbOptimized Projection

100Mb 100Mb 100Mb

XMark Query 15

(Long, very selective path

query)

No Projection 33Mb 220Mb 520MbOptimized Projection

1Gb 2Gb 2Gb

Page 21: Projecting XML Documents

9/11/2003 21

Experiments: Query Execution Time

0

50

100

150

200

250

To

tal Q

uer

y E

xecu

tio

n T

ime

(in

sec

on

ds)

Projection significantly reduces query processing timeNext Bottleneck: Joins!

0

1000

2000

3000

4000

5000

6000

7000

8000

Query8

Query9

Query10

Query11

Query12

Query14

No Projection

Projection

Optimized Projection

Page 22: Projecting XML Documents

9/11/2003 22

Improvements

Complete XQuery implementation with projection available in Galax Demo at VLDB 2003.

Galax uses a more recent pure streaming algorithm for applying projection to a document. Better performance. Can be used as a stand-alone

operation, without loading.

Page 23: Projecting XML Documents

9/11/2003 23

Conclusion and Future Work

Main contributions: Definition of a notion of projection for XML. Static analysis to infer projection paths from any

XQuery expression. Full implementation in Galax.

Experimental results: Dramatic increase in the size main-memory XQuery

processor can handle. Projection helps reducing query processing time.

Future work: Define loading algorithm for backward axis. Combine projection with other optimizations. Pushing down query operations during projection

(e.g., predicate evaluation)

Page 24: Projecting XML Documents

9/11/2003 24

Advertisment“XQuery from the Experts” Just released by Addison Wesley

Ask Jerome for 20% discount flyers!


Top Related