[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - An Estimation System for XPath Expressions

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - An Estimation System for XPath Expressions

Post on 02-Mar-2017




0 download

Embed Size (px)


<ul><li><p>An Estimation System for XPath Expressions</p><p>Hanyu Li Mong Li Lee Wynne HsuSchool of Computing, National University of Singapore</p><p>{lihanyu,leeml,whsu}@comp.nus.edu.sg</p><p>Gao CongUniversity of Edinburgh</p><p>gao.cong@ed.ac.uk</p><p>Abstract</p><p>Estimating the result sizes of XML queries is important inquery optimization and is useful in providing a quick feed-back about the queries. Existing works have focused on theselectivity estimation of XML queries without order-basedaxes. In this work, we develop a framework to estimate theresult sizes of XPath expressions with order-based axes. Wedescribe how the path and order information of XML el-ements can be captured and summarized in compact datastructures. We also describe methods to estimate the selec-tivity of XPath queries. The results of extensive experimentson both synthetic and real-world datasets demonstrate theeffectiveness and accuracy of the proposed approach.</p><p>1 Introduction</p><p>The increasing number of XML repositories has led tothe design and development of systems to efficiently storeXML data and process XML queries. XML can be modeledas an ordered tree pattern that specifies the sequence orderof sibling nodes. For example, if a book is organized usingXML, then the chapter order of the book is important anda query can ask for the second chapter of the book. Otherexamples include data with ordered time domain (temporalXML) and DNA sequences stored using XML. Hence, itis important to provide support for XML queries involvingorder axes in applications with intrinsic ordered data.</p><p>XPath [4] has emerged as the standard for navigatingXML documents. XPath utilizes path expressions to locatenodes in an XML tree, and has the following syntax:</p><p>PathExpr ::= /Step1/Step2/.../StepnStep ::= Axis :: NodeTest Predicate</p><p>Given an XPath query, the Axis in the Step establishesthe set of XML nodes that are reachable via this axis whereNodeTest examines the node name and a set of Predicatescan be imposed on the nodes. For example, the XPathquery /descendant::Play/child::Act retrieves all the actsof a play. For simplicity, this query can be alternatively</p><p>expressed as //Play/Act where // and / denote descen-dant and child axes respectively.</p><p>The preceding and following axes in XPath describe thesequence of nodes before/after the context node (exclud-ing any ancestors/descendants). For example, the query//Storm/following::Tornado requires that the element Tor-nado must occur after the element Storm. XPath also pro-vides the preceding-sibling and following-sibling axes to se-lect all preceding or following sibling nodes.</p><p>Existing labeling schemes such as interval based [9, 17]and prime number [15] preserve the order information ofXML data. [14] examines how ordered XML can be storedand queried using a relational database system. However, akey issue that has been neglected is how the selectivity ofXML queries with order-based axes can be estimated.</p><p>The selectivity estimation of XML queries with orderaxes is a challenging task given the huge volume of theorder information that needs to be summarized. ExistingXML selectivity estimators [5, 6, 10, 11, 12, 13, 16] are de-signed specifically for XML queries without order axes, andthe order information is typically not captured.</p><p>Contributions. This paper describes a framework forestimating the selectivity of XPath expressions with order-based axes. To the best of our knowledge, this is the firstwork to address the problem of summarizing order infor-mation in XML data. The key contributions are:</p><p>1) We use a path encoding scheme to aggregate the pathand order information of XML data. This scheme asso-ciates each node in an XML tree with a path id that in-dicates the type of path where the node occurs. Basedon the path ids, the frequencies of element tags andsibling node sequences can be collected.</p><p>2) We design two compact structures, p-histogram ando-histogram, to summarize the path and order infor-mation of XML data respectively. In order to reducethe effect of data skewness in the buckets, we usethe intra-bucket frequency variance to control the his-togram construction.</p><p>3) We develop effective methods to estimate the selec-tivity of XPath expressions. We first remove the ir-relevant path ids associated with elements involved in</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>a query. Then the frequency values of the remainingpath ids are utilized to calculate the selectivity.</p><p>4) We carry out an extensive experimental study of theproposed approach. The results show that the pro-posed solution yields very low estimation error ratesfor XPath queries even with limited memory space.</p><p>The rest of this paper is organized as follows. Section 2gives the background. Section 3 describes the path and or-der information captured. The estimation methods are pre-sented in Sections 4 and 5. We describe the data structuresin Section 6. Section 7 gives the experimental results. Fi-nally, we discuss related work in Section 8 and conclude inSection 9.</p><p>2 Preliminaries</p><p>[8] designs a path encoding scheme to label XML nodesfor efficient structural join. In this section, we review thislabeling scheme and the notion of path id containmentwhich is the basis of the proposed estimation system.</p><p>Path Encoding Scheme. The path encoding scheme [8]uses an integer to encode each distinct root-to-leaf path in anXML document and stores them in an encoding table. Fig-ure 1(a) shows an XML document with four distinct root-to-leaf paths. Each distinct path is assigned with an integerand inserted into the encoding table shown in Figure 1(b).</p><p>The labeling scheme associates each element node in anXML document with a path id that comprises a sequence ofbits. The number of bits is the size of distinct root-to-leafpaths, i.e., the number of tuples in the encoding table. Pathids are assigned to nodes as follows:</p><p>1) For a leaf node, the path id is given by setting the ithbit (from the left) to 1, where i denotes the encodingof the root-to-leaf path on which the leaf node occurs.</p><p>2) For a non-leaf node, the path id is given by a bit-oroperation on the path ids of all its child nodes.</p><p>Example 2.1: In Figure 1(a), the path id of the first leafnode D is p5(1000) since the encoding of the path Root/A/B/D on which D occurs is 1. The path id of thefirst C node, p3(0011), is obtained by a bit-or operation onthe path ids of its child nodes E and F , whose path ids arep2(0010) and p1(0001) respectively. All bit sequences arecollected in a path id table (Figure 1(c)). </p><p>Path Id Containment. The path ids can be utilized todetect parent-child or ancestor-descendant relationshipsbetween two sets of nodes. Let SX be a set of elementnodes labeled with X that has the same path id PidX . LetSY be another set of element nodes with similar definition.We will discuss how to determine the relationship betweenthese two sets of nodes.</p><p>4</p><p>3</p><p>2</p><p>1</p><p>EncodingRoottoleaf</p><p>Root/A/B/D</p><p>Root/A/C/E</p><p>Root/A/C/F</p><p>(c) Path Id Table</p><p>(a) XML Instance</p><p>(b) Encoding Table</p><p>Root/A/B/E p3</p><p>p2</p><p>p1</p><p>Int</p><p>1111</p><p>1100</p><p>1011</p><p>1010</p><p>1000</p><p>0100</p><p>0011</p><p>p4</p><p>p6</p><p>p5</p><p>p8</p><p>p7</p><p>p9</p><p>BitSeq</p><p>0001</p><p>0010</p><p>C(p2)</p><p>A(p8)</p><p>B(p5)</p><p>E(p2) D(p5)D(p5) F(p1)</p><p>C(p3)B(p8) B(p5)</p><p>A(p7)</p><p>D(p5) E(p4) D(p5)</p><p>B(p5)</p><p>E(p2)</p><p>A(p6)</p><p>Root(p9)</p><p>Figure 1. Path Encoding Scheme</p><p>Case 1: PidX = PidY . There exists at least one ancestor-descendant relationship between a node x, x SX and anode y, y SY . A path id Pid can be decomposed into aset of root-to-leaf paths, each of which corresponds to onebit with value 1 in Pid. Thus, the relationship of x and y canbe determined by the relationship of their tags, X and Y , inany one of the root-to-leaf paths of PidX (or PidY ). Givena root-to-leaf path and two element tags, we can check therelationship between the tags from the encoding table.Example 2.2: Consider the element tags A and B withthe same path id p8 (1100) in Figure 1. We obtain theroot-to-leaf paths with encoding value 1 and 2 from the pathid p8 (1100) since the bits in the positions 1 and 2 are 1.After checking the root-to-leaf path with encoding value 1in the encoding table, we know that tag A is the ancestor oftag B. Hence, each node A with path id p8 is an ancestor ofB with path id p8. Further, element A is the parent of B. </p><p>Case 2: PidX = PidY . We say PidX contains PidYif PidX = PidY and (PidX &amp; PidY ) = PidY , where &amp;denotes the bit-and operation. If PidX contains PidY ,each node x in SX must have at least one descendant nodey, y SY . This is because each x node in SX must occuron the path(s) where at least one y element occurs. Further,PidX has at least one bit with value 1 such that the corre-sponding bit of PidY is 0 due to the path id containment.In other words, there exists at least one path such that nodex occurs while node y does not occur. Therefore, node inSX must have at least one element y in SY as descendant.</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>Example 2.3: In Figure 1, the path id p3 (0011) for nodeC contains the path id p2 (0010) for node E. Each node Cwith p3 (0011) must be the ancestor of at least one node Ewith path id p2 (0010). We also know that C is parent of Eby looking up the common path of p2 and p3. </p><p>3 Capturing Path and Order Information</p><p>The proposed estimation system captures the path andorder information of XML data in the PathId-Frequencytable and Path-Order table respectively.</p><p>PathId-Frequency Table. Each tuple in the pathId-frequency table represents a distinct element tag in an XMLdocument, and we aggregate all the path ids and their cor-responding frequencies of each element tag.Example 3.1: Figure 2(a) shows the pathId-frequencytable for the XML data in Figure 1. Since there are two Celements occurring in the XML document, one of which isassociated with path id p3 and the other one with p2, theentry for C in the table contains {(p2, 1), (p3, 1)}. Path-Order Table. The path-order table captures the sib-ling order information based on the path ids. Each distinctelement tag is associated with a path-order table.</p><p>Given an element tag X , each column in its path-ordertable denotes one path id on which the elements X oc-cur, and each row represents one element tag in the XMLdocument. There are two regions in the path-order ta-ble, namely, +element and element+ regions. In the+element region, a grid cell, denoted by g(pathid, tag),represents the number of elements X with pathid oc-curring before elements tag. In contrast, the grid cellg(pathid, tag) in element+ area denotes the frequenciesof X with pathid occurring after elements tag.</p><p>Root........</p><p>C</p><p>B</p><p>A</p><p>Ele</p><p>A</p><p>B</p><p>C</p><p>D</p><p>E</p><p>F</p><p>Root (p9,1)</p><p>(p1,1)</p><p>A</p><p>Ele+</p><p>+Ele</p><p>Ele</p><p>Root</p><p>for Element B</p><p>(b) PathOrder Table(a) PathId Frequency Table</p><p>........</p><p>Path Idp8p5</p><p>1</p><p>2C</p><p>B</p><p>(p4,1) (p2,2)</p><p>(p5,4)</p><p>(p2,1) (p3,1)</p><p>(p8,1) (p5,3)</p><p>(p6,1) (p7,1) (p8,1)</p><p>(Path_id, Frequency)</p><p>Figure 2. Path and Order Information</p><p>Example 3.2: Figure 2(b) shows the path-order table forelement B in Figure 1. Since one B element annotated with</p><p>p5 occurs before C and two B elements with p5 occur afterelement C, the values in the corresponding cells are 1 and 2respectively. All the other cells are empty. </p><p>Note that if an element with tag X occurs both after andbefore elements with tag Y , then the two rows for tag Y inthe path-order table of X will count this X element.</p><p>The pathId-frequency table captures path ids of each ele-ment tag and their frequencies. This information is utilizedto estimate the selectivity of XPath expressions without or-der axes. In contrast, the path-order table aggregates thefrequencies of sibling nodes. This information is used toestimate XPath expressions with order axes.</p><p>4 Estimating Queries with No Order Axes</p><p>In this section, we describe the estimation of simple andbranch queries without order axes which is based on a pathid join algorithm [8].</p><p>Path Join. Given an XPath query Q, the path join retrievesa set of path ids and the corresponding frequencies for eachelement tag in Q from the pathId-frequency table. For eachpair of adjacent element tags in Q, we use a nested loop todetermine the containment of the path ids in their sets. Pathids that clearly do not contribute to the query result will beremoved. The frequency values of the remaining path idswill be utilized to estimate the query size.</p><p>B</p><p>F</p><p>A {(p6,1) (p7,1) (p8,1)}</p><p>C</p><p>{(p5,4)}</p><p>{(p8,1) (p5,3)}</p><p>D</p><p>{(p2,1) (p3,1)}</p><p>{(p1,1)}</p><p>(a) Q1, before path id join</p><p>B</p><p>F</p><p>A {(p7,1)}</p><p>C</p><p>{(p1,1)} {(p5,4)}</p><p>{(p3,1)}</p><p>D</p><p>{(p5,3)}</p><p>(b) Q1, after path id join</p><p>Figure 3. Path Ids Test</p><p>Example 4.1: Let us issue the query Q1 = //A[/C/F ]/B/D in Figure 3(a) on the XML data in Figure 1. The pathid join algorithm evaluates query Q1 by removing irrelevantpath ids from the nodes in the query. The final result isshown in Figure 3(b). We observe that path id p2 for Cis removed from the path id list because p2 cannot containthe path id p1 for F . Further, path id p6 and p8 for A areremoved since they cannot contain the path id p3 for thechild node C, etc. </p><p>Given an XPath query Q with target node n (thenode whose selectivity is to be estimated), we denote theselectivity for n as SQ(n) and the sum of frequencies ofremaining path ids after the path join for n as fQ(n).</p><p>Estimating Simple Queries. A simple XPath query Q hasthe basic form /n1/n2/.../ni where / and ni denote the</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>child axis and label of node respectively. The followingtheorem computes the selectivity of simple XPath queries.</p><p>Theorem 4.1 After carrying out a path id join on a simplequery Q, the summarized frequency value fQ(n) of node n,is the same as the selectivity SQ(n), that is,</p><p>SQ(n) = fQ(n) (1)</p><p>Proof: After a path id join, only the path ids that satisfythe path id containment relationship are associated with thenodes involved in the query. Thus, fQ(n) = SQ(n). </p><p>Example 4.2: Consider the query //A//C issued on theXML instance in Figure 1. After a path id join, the lists ofpath ids that are associated with A and C are {p6, p7} and{p2, p3} respectively. From the corresponding frequen-cies, we know that the selectivity for both A and C are 2. </p><p>Estimating Branch Queries. We define a branchquery pattern Q as /n1/.../ni[/ni1/.../nil]/ni+1.../nm.The path /n1/.../ni is the trunk part, while the paths/ni1/.../nil and /ni+1.../nm are the branch parts of thequery. XPath provides different formats, such as q1[/q2]/q3or q1[/q2][/q3] to specify the position of target node whoseselectivity is to be estimated. Here, we standardize thebranch query pattern as q1[/q2]/q3 where qi is a simplequery and explicitly specify the target node.</p><p>Given a branch query Q, if the target node n occurs in thetrunk part q1 of the query, then according to Theorem 4.1,we have SQ(n) = fQ(n). However, if n occurs on the branchpart q2 or q3 of Q, then fQ(n) may over-estimate SQ(n).This is because path ids are designed to directly capture theparent-child and ancestor-descendant containment rel...</p></li></ul>


View more >