Download - [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - An Estimation

An Estimation System for XPath Expressions

Hanyu Li Mong Li Lee Wynne HsuSchool of Computing, National University of Singapore

{lihanyu,leeml,whsu}@comp.nus.edu.sg

Gao CongUniversity of Edinburgh

[email protected]

Abstract

Estimating the result sizes of XML queries is important inquery optimization and is useful in providing a quick feed-back about the queries. Existing works have focused on theselectivity estimation of XML queries without order-basedaxes. In this work, we develop a framework to estimate theresult sizes of XPath expressions with order-based axes. Wedescribe how the path and order information of XML el-ements can be captured and summarized in compact datastructures. We also describe methods to estimate the selec-tivity of XPath queries. The results of extensive experimentson both synthetic and real-world datasets demonstrate theeffectiveness and accuracy of the proposed approach.

1 Introduction

The increasing number of XML repositories has led tothe design and development of systems to efficiently storeXML data and process XML queries. XML can be modeledas an ordered tree pattern that specifies the sequence orderof sibling nodes. For example, if a book is organized usingXML, then the chapter order of the book is important anda query can ask for the second chapter of the book. Otherexamples include data with ordered time domain (temporalXML) and DNA sequences stored using XML. Hence, itis important to provide support for XML queries involvingorder axes in applications with intrinsic ordered data.

XPath [4] has emerged as the standard for navigatingXML documents. XPath utilizes path expressions to locatenodes in an XML tree, and has the following syntax:

PathExpr ::= /Step1/Step2/.../Stepn

Step ::= Axis :: NodeTest Predicate∗

Given an XPath query, the Axis in the Step establishesthe set of XML nodes that are reachable via this axis whereNodeTest examines the node name and a set of Predicatescan be imposed on the nodes. For example, the XPathquery “/descendant::Play/child::Act” retrieves all the actsof a play. For simplicity, this query can be alternatively

expressed as “//Play/Act” where “//” and “/” denote descen-dant and child axes respectively.

The preceding and following axes in XPath describe thesequence of nodes before/after the context node (exclud-ing any ancestors/descendants). For example, the query“//Storm/following::Tornado” requires that the element Tor-nado must occur after the element Storm. XPath also pro-vides the preceding-sibling and following-sibling axes to se-lect all preceding or following sibling nodes.

Existing labeling schemes such as interval based [9, 17]and prime number [15] preserve the order information ofXML data. [14] examines how ordered XML can be storedand queried using a relational database system. However, akey issue that has been neglected is how the selectivity ofXML queries with order-based axes can be estimated.

The selectivity estimation of XML queries with orderaxes is a challenging task given the huge volume of theorder information that needs to be summarized. ExistingXML selectivity estimators [5, 6, 10, 11, 12, 13, 16] are de-signed specifically for XML queries without order axes, andthe order information is typically not captured.

Contributions. This paper describes a framework forestimating the selectivity of XPath expressions with order-based axes. To the best of our knowledge, this is the firstwork to address the problem of summarizing order infor-mation in XML data. The key contributions are:

1) We use a path encoding scheme to aggregate the pathand order information of XML data. This scheme asso-ciates each node in an XML tree with a path id that in-dicates the type of path where the node occurs. Basedon the path ids, the frequencies of element tags andsibling node sequences can be collected.

2) We design two compact structures, p-histogram ando-histogram, to summarize the path and order infor-mation of XML data respectively. In order to reducethe effect of data skewness in the buckets, we usethe intra-bucket frequency variance to control the his-togram construction.

3) We develop effective methods to estimate the selec-tivity of XPath expressions. We first remove the ir-relevant path ids associated with elements involved in

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

a query. Then the frequency values of the remainingpath ids are utilized to calculate the selectivity.

4) We carry out an extensive experimental study of theproposed approach. The results show that the pro-posed solution yields very low estimation error ratesfor XPath queries even with limited memory space.

The rest of this paper is organized as follows. Section 2gives the background. Section 3 describes the path and or-der information captured. The estimation methods are pre-sented in Sections 4 and 5. We describe the data structuresin Section 6. Section 7 gives the experimental results. Fi-nally, we discuss related work in Section 8 and conclude inSection 9.

2 Preliminaries

[8] designs a path encoding scheme to label XML nodesfor efficient structural join. In this section, we review thislabeling scheme and the notion of path id containmentwhich is the basis of the proposed estimation system.

Path Encoding Scheme. The path encoding scheme [8]uses an integer to encode each distinct root-to-leaf path in anXML document and stores them in an encoding table. Fig-ure 1(a) shows an XML document with four distinct root-to-leaf paths. Each distinct path is assigned with an integerand inserted into the encoding table shown in Figure 1(b).

The labeling scheme associates each element node in anXML document with a path id that comprises a sequence ofbits. The number of bits is the size of distinct root-to-leafpaths, i.e., the number of tuples in the encoding table. Pathids are assigned to nodes as follows:

1) For a leaf node, the path id is given by setting the ithbit (from the left) to 1, where i denotes the encodingof the root-to-leaf path on which the leaf node occurs.

2) For a non-leaf node, the path id is given by a bit-oroperation on the path ids of all its child nodes.

Example 2.1: In Figure 1(a), the path id of the first leafnode D is p5(1000) since the encoding of the path “Root/A/B/D” on which D occurs is 1. The path id of thefirst C node, p3(0011), is obtained by a bit-or operation onthe path ids of its child nodes E and F , whose path ids arep2(0010) and p1(0001) respectively. All bit sequences arecollected in a path id table (Figure 1(c)). �

Path Id Containment. The path ids can be utilized todetect parent-child or ancestor-descendant relationshipsbetween two sets of nodes. Let SX be a set of elementnodes labeled with X that has the same path id PidX . LetSY be another set of element nodes with similar definition.We will discuss how to determine the relationship betweenthese two sets of nodes.

4

3

2

1

EncodingRoot−to−leaf

Root/A/B/D

Root/A/C/E

Root/A/C/F

(c) Path Id Table

(a) XML Instance

(b) Encoding Table

Root/A/B/E p3

p2

p1

Int

1111

1100

1011

1010

1000

0100

0011

p4

p6

p5

p8

p7

p9

Bit−Seq

0001

0010

C(p2)

A(p8)

B(p5)

E(p2) D(p5)D(p5) F(p1)

C(p3)B(p8) B(p5)

A(p7)

D(p5) E(p4) D(p5)

B(p5)

E(p2)

A(p6)

Root(p9)

Figure 1. Path Encoding Scheme

Case 1: PidX = PidY . There exists at least one ancestor-descendant relationship between a node x, x ∈ SX and anode y, y ∈ SY . A path id Pid can be decomposed into aset of root-to-leaf paths, each of which corresponds to onebit with value 1 in Pid. Thus, the relationship of x and y canbe determined by the relationship of their tags, X and Y , inany one of the root-to-leaf paths of PidX (or PidY ). Givena root-to-leaf path and two element tags, we can check therelationship between the tags from the encoding table.Example 2.2: Consider the element tags A and B withthe same path id p8 (1100) in Figure 1. We obtain theroot-to-leaf paths with encoding value 1 and 2 from the pathid p8 (1100) since the bits in the positions 1 and 2 are 1.After checking the root-to-leaf path with encoding value 1in the encoding table, we know that tag A is the ancestor oftag B. Hence, each node A with path id p8 is an ancestor ofB with path id p8. Further, element A is the parent of B. �

Case 2: PidX �= PidY . We say PidX contains PidY

if PidX �= PidY and (PidX & PidY ) = PidY , where &denotes the “bit-and” operation. If PidX contains PidY ,each node x in SX must have at least one descendant nodey, y ∈ SY . This is because each x node in SX must occuron the path(s) where at least one y element occurs. Further,PidX has at least one bit with value 1 such that the corre-sponding bit of PidY is 0 due to the path id containment.In other words, there exists at least one path such that nodex occurs while node y does not occur. Therefore, node inSX must have at least one element y in SY as descendant.


Example 2.3: In Figure 1, the path id p3 (0011) for nodeC contains the path id p2 (0010) for node E. Each node Cwith p3 (0011) must be the ancestor of at least one node Ewith path id p2 (0010). We also know that C is parent of Eby looking up the common path of p2 and p3. �

3 Capturing Path and Order Information

The proposed estimation system captures the path andorder information of XML data in the PathId-Frequencytable and Path-Order table respectively.

PathId-Frequency Table. Each tuple in the pathId-frequency table represents a distinct element tag in an XMLdocument, and we aggregate all the path ids and their cor-responding frequencies of each element tag.Example 3.1: Figure 2(a) shows the pathId-frequencytable for the XML data in Figure 1. Since there are two Celements occurring in the XML document, one of which isassociated with path id p3 and the other one with p2, theentry for C in the table contains {(p2, 1), (p3, 1)}. �

Path-Order Table. The path-order table captures the sib-ling order information based on the path ids. Each distinctelement tag is associated with a path-order table.

Given an element tag X , each column in its path-ordertable denotes one path id on which the elements X oc-cur, and each row represents one element tag in the XMLdocument. There are two regions in the path-order ta-ble, namely, +element and element+ regions. In the+element region, a grid cell, denoted by g(pathid, tag),represents the number of elements X with pathid oc-curring before elements tag. In contrast, the grid cellg(pathid, tag) in element+ area denotes the frequenciesof X with pathid occurring after elements tag.

Root........

C

B

A

Ele

A

B

C

D

E

F

Root (p9,1)

(p1,1)

A

Ele+

+Ele

Ele

Root

for Element B

(b) Path−Order Table(a) PathId − Frequency Table

........

Path Idp8p5

1

2C

B

(p4,1) (p2,2)

(p5,4)

(p2,1) (p3,1)

(p8,1) (p5,3)

(p6,1) (p7,1) (p8,1)

(Path_id, Frequency)

Figure 2. Path and Order Information

Example 3.2: Figure 2(b) shows the path-order table forelement B in Figure 1. Since one B element annotated with

p5 occurs before C and two B elements with p5 occur afterelement C, the values in the corresponding cells are 1 and 2respectively. All the other cells are empty. �

Note that if an element with tag X occurs both after andbefore elements with tag Y , then the two rows for tag Y inthe path-order table of X will count this X element.

The pathId-frequency table captures path ids of each ele-ment tag and their frequencies. This information is utilizedto estimate the selectivity of XPath expressions without or-der axes. In contrast, the path-order table aggregates thefrequencies of sibling nodes. This information is used toestimate XPath expressions with order axes.

4 Estimating Queries with No Order Axes

In this section, we describe the estimation of simple andbranch queries without order axes which is based on a pathid join algorithm [8].

Path Join. Given an XPath query Q, the path join retrievesa set of path ids and the corresponding frequencies for eachelement tag in Q from the pathId-frequency table. For eachpair of adjacent element tags in Q, we use a nested loop todetermine the containment of the path ids in their sets. Pathids that clearly do not contribute to the query result will beremoved. The frequency values of the remaining path idswill be utilized to estimate the query size.

B

F

A {(p6,1) (p7,1) (p8,1)}

C

{(p5,4)}

{(p8,1) (p5,3)}

D

{(p2,1) (p3,1)}

{(p1,1)}

(a) Q1, before path id join

B

F

A {(p7,1)}

C

{(p1,1)} {(p5,4)}

{(p3,1)}

D

{(p5,3)}

(b) Q1, after path id join

Figure 3. Path Ids Test

Example 4.1: Let us issue the query Q1 = //A[/C/F ]/B/D in Figure 3(a) on the XML data in Figure 1. The pathid join algorithm evaluates query Q1 by removing irrelevantpath ids from the nodes in the query. The final result isshown in Figure 3(b). We observe that path id p2 for Cis removed from the path id list because p2 cannot containthe path id p1 for F . Further, path id p6 and p8 for A areremoved since they cannot contain the path id p3 for thechild node C, etc. �

Given an XPath query Q with target node n (thenode whose selectivity is to be estimated), we denote theselectivity for n as SQ(n) and the sum of frequencies ofremaining path ids after the path join for n as fQ(n).

Estimating Simple Queries. A simple XPath query Q hasthe basic form /n1/n2/.../ni where / and ni denote the


child axis and label of node respectively. The followingtheorem computes the selectivity of simple XPath queries.

Theorem 4.1 After carrying out a path id join on a simplequery Q, the summarized frequency value fQ(n) of node n,is the same as the selectivity SQ(n), that is,

SQ(n) = fQ(n) (1)

Proof: After a path id join, only the path ids that satisfythe path id containment relationship are associated with thenodes involved in the query. Thus, fQ(n) = SQ(n). �

Example 4.2: Consider the query “//A//C” issued on theXML instance in Figure 1. After a path id join, the lists ofpath ids that are associated with A and C are {p6, p7} and{p2, p3} respectively. From the corresponding frequen-cies, we know that the selectivity for both A and C are 2. �

Estimating Branch Queries. We define a branchquery pattern Q as /n1/.../ni[/ni1/.../nil]/ni+1.../nm.The path /n1/.../ni is the trunk part, while the paths/ni1/.../nil and /ni+1.../nm are the branch parts of thequery. XPath provides different formats, such as q1[/q2]/q3

or q1[/q2][/q3] to specify the position of target node whoseselectivity is to be estimated. Here, we standardize thebranch query pattern as q1[/q2]/q3 where qi is a simplequery and explicitly specify the target node.

Given a branch query Q, if the target node n occurs in thetrunk part q1 of the query, then according to Theorem 4.1,we have SQ(n) = fQ(n). However, if n occurs on the branchpart q2 or q3 of Q, then fQ(n) may over-estimate SQ(n).This is because path ids are designed to directly capture theparent-child and ancestor-descendant containment relation-ship, but not the relationship between sibling nodes.

{ (p1,1) }

FE

C { (p3,1) }

{ (p2,2) }(a) Q2, after path id join

{ (p2,2) }

C { (p2,1) (p3,1) }

E

(b) Q′2, after path id join

Figure 4. Example of a Branch Query

Example 4.3: Figure 4(a) shows a branch query Q2 =//C[/E]/F issued on the XML instance in Figure 1. Thetarget node E is circled. The path join associates E with apath id set {(p2, 2)}. Figure 1 shows that only one E ele-ment with p2 is the answer. The other E element is not inthe result because the path id p2 of its parent C has beenremoved during the path id containment test between C andF . Note the estimation for C is the correct answer. �

To compensate for this over-estimation, we devise amethod which utilizes the correct selectivity information of

other nodes to determine the selectivity of the target nodeoccurring in the branch parts. This method is based on thefollowing assumption.

Node Independence Assumption: Given a branch queryQ = q1[/q2]/q3 with target node n in the branch part q2, thedistribution of node n on q2 in the XML data is independentof the distribution of all other nodes in the other branch path,i.e., q3, on which n does not occur.Example 4.4: Suppose we issue two queries Q1 =//A[/B]/C and Q2 = //A/B on an XML document.Based on the Node Independence Assumption, we haveSQ1(B)/SQ1(A) ≈ SQ2(B)/SQ2(A), since the distributionof B under A is independent of the distribution of node Cwhich occurs on the other branch. �

Suppose the target node n belongs to the branch q2 ofQ. To estimate SQ(n), we generate a simple query Q′ =q1/q2 from Q by ignoring the branch q3. The results of ni

(the last node of q1) on query Q is a subset of that of ni

on Q′. The results of nodes occurring on q2 of Q are alsodecreased. Based on the Node Independence Assumption,we infer that SQ(n)/SQ(ni) ≈ SQ′(n)/SQ′(ni). Since Q′ isa simple query, we have the correct selectivity values of nand ni in Q′ based on Theorem 4.1, i.e., SQ′(n)= fQ′(n) andSQ′(ni) = fQ′(ni). In addition, SQ(ni) = fQ(ni) if ni is inthe trunk part of Q. Thus, we have the following formula:

SQ(n) ≈ fQ′(n) ∗ fQ(ni)/fQ′(ni) (2)

Example 4.5: Consider query Q2 in Figure 4(a) where nodeC is the last element node in trunk part, and E is the targetnode. We generate a new query Q′

2 by cutting off the branchpath where target node does not occur (see Figure 4(b)).After a path id join on both queries, the values of fQ2(C),fQ′

2(C) and fQ′

2(E) are 1, 2 and 2 respectively. Hence, we

estimate SQ2(E) as fQ′2(E) ∗ fQ2(C)/fQ′

2(C) = 1.

5 Estimating Queries with Order Axes

Next, we present the techniques to estimate queries withpreceding-sibling and following-sibling axes, and extendthe proposed methods to estimate queries with precedingand following axes.

Preceding-Sibling/Following-Sibling Axis. An XPathquery with order axes can be denoted as �Q = q1[/q2

/folls :: q3] (or q1[/q2/pres :: q3]). �Q requires thatboth branches q2 and q3 occur under q1, and the entire pathexpression q2 occurs before (after) q3 where folls (pres)represents following-sibling (preceding-sibling) axis. Wedenote the counterpart query without order axes of �Q asQ = q1[/q2]/q3, which is generated from �Q by removingthe order axes.

Given an XPath query �Q = q1[/q2/folls :: q3], we usepath-order table to compute the selectivity of sibling nodes


in �Q, i.e., the first nodes of q2 and q3. The results of thesesibling nodes are utilized to compute the selectivity of othernodes in the query. The method to determine the selectivityof target node depends on whether the target node occurs inthe branch part or trunk part of the query.

Case 1: Target Node in Branch Part.We first examine the situation where the target node on

the branch part is also a sibling node.Consider the query �Q = q1[/q2/folls :: q3] with

target node ni+1 (the first node of q3). A path join is firstcarried out on �Q (or Q = q1[/q2]/q3). Next, if we directlyuse the remaining path ids of ni+1 to retrieve frequencyvalues from its path-order table, we may over-estimate theselectivity. This is because in the path-order table for ni+1,there is no path id condition imposed on the element ni1

(the first node of q2, the sibling node of ni+1). However,in query �Q, we require that ni1 must occur in the querypattern q1/q2. To overcome this problem, we make thefollowing assumption.

Node Order Uniformity Assumption: Given m elementsX such that they are the sibling nodes of Y and ms out of mX elements occur before (or after) Y , these ms X elementsare uniformly distributed in the all m X elements. That is, ifwe randomly select m′ from m X elements, there will existm′

s X elements which occur before (or after) Y , such thatm′/m ≈ m′

s/ms.We generate a simplified query �Q′=q1[/ni1/folls :: q3]

from �Q by deleting the branch part q2 except for its firstnode ni1. Then we compute the selectivity S �Q′(ni+1) andthe selectivity of nodes in its counterpart Q′= q1[/ni1]/q3

without order axes.For query Q′ = q1[/ni1]/q3, we have SQ′(ni+1) (i.e.,

m in the Node Order Uniformity Assumption) elementstagged with ni+1 that satisfy Q′ and are siblings of ni1,where S �Q′(ni+1) (i.e., ms) elements tagged with ni+1 oc-cur after ni1. If we select SQ(ni+1) (i.e.,m′) elements fromresults of ni+1 on Q′, we will have SQ(ni+1)/SQ′(ni+1) ≈S�Q(ni+1)/S �Q′(ni+1) (S�Q(ni+1) is m′

s). Thus, we have

S�Q(ni+1) ≈ S �Q′(ni+1) ∗ SQ(ni+1)/SQ′(ni+1) (3)

The selectivity values SQ(ni+1) and SQ′(ni+1) can be es-timated by using the estimation method for branch queries.The correct S �Q′(ni+1) can be retrieved from the path-ordertable for ni+1 as follows. After the path join on Q′, for eachremaining pid associated with ni+1, we retrieve g(pid, ni1),i.e., the number of ni+1 elements with path id pid that oc-cur after ni1 from the path-order table for ni+1. Then thesummary of all such g(pid, ni1) is the selectivity of ni+1

for �Q′ according to the path-order table.The value S �Q′(ni+1) obtained is the correct selectivity

of ni+1 in �Q′. This is because after the path id join, the

path ids associated with ni+1 represent the correct resultset for ni+1 in the simple query q1/q3. Further, the correctfrequency value for ni+1 with these path ids which occurafter ni1 are recorded in the path-order table for ni+1.

B

F

A {(p7,1)}

C

{(p1,1)} {(p5,4)}

{(p5,3)}{(p3,1)}

D

(a) �Q1, after path id join

B

A {(p6,1) (p7,1)}

C

{(p5,4)}

{(p5,3)}

{(p2,1) (p3,1)}

D

(b) �Q′1, after path id join

Figure 5. XPath Query with Order Axes

Example 5.1: Figure 5(a) shows a query �Q1 = A[/C[/F ]/folls :: /B/D] with target node B. The sim-plified query �Q′

1 = A[/C/folls :: B/D] is shown in Figure5(b). The nodes in Figure 5 are annotated with the remain-ing path ids after a path id join. The value of S �Q′

1(B), which

is 2, is retrieved from the path-order table for B (see Fig-ure 2(b)) with element tag C and path id p5 which is theremaining path id in Figure 5(b). The values of SQ1(B)

and SQ′1(B) are estimated as 1.3 and 2.6 respectively by us-

ing estimation method without order axes. Finally, S �Q1(B)

=S �Q′1(B) ∗ SQ1(B)/SQ′

1(B)= 2 ∗ 1.3/2.6 = 1. �

Next, we consider the query �Q where the target node noccurs in the branch part but it is not the sibling node ni1 orni+1. Suppose that n occurs in q3. To utilize the selectivityof ni+1, we make the following assumption.

Node Containment Uniformity Assumption: Given mx

ancestors X and my descendants Y in an XML dataset, weassume that all the elements Y are uniformly distributed un-der all their ancestors X . That is, if we randomly select m′

x

out of mx elements X , these X elements will contain m′y

Y descendants, such that m′x/mx ≈ m′

y/my .The above assumption is applicable when ni+1 and tar-

get node n correspond to the X and Y in the assumptionrespectively. We have S�Q(ni+1)/SQ(ni+1) ≈ S�Q(n)/SQ(n).Since S�Q(ni+1)/SQ(ni+1) ≈ S �Q′(ni+1)/SQ′(ni+1) (Equa-tion (3)), we have

S�Q(n) ≈ SQ(n) ∗ S �Q′(ni+1)/SQ′(ni+1) (4)

Example 5.2: Consider the query �Q1 in Figure 5(a). Letthe target node be D. S �Q1

(D) is estimated as SQ1(D)∗S �Q′

1(B)/SQ′

1(B) = 1.3∗2/2.6 = 1, where the value of S �Q′

1(B)

is retrieved from path-order table for B, and SQ1(D) andSQ′

1(B) are estimated as 1.3 and 2.6 respectively. �

Case 2: Target Node in Trunk Part.When the target node n occurs in the trunk part q1 of �Q, it

is obvious that the selectivity S�Q(n) must not be larger than


SQ(n), the upper bound of S�Q(n). We can further optimizethe estimation with order information.

We observe that when the order axes of �Q is imposedon the query Q, some elements ni1 (the first node of q2)that do not satisfy the order axes will be eliminated fromthe query result sets. According to the Node ContainmentUniformity Assumption, these eliminated ni1 elements areuniformly distributed under all elements n that satisfy thequery Q (without order axes), and so are the remaining el-ements ni1. Thus, we can deduce that the elimination ofelement ni1 as a result of imposing the order axes will notaffect the selectivity SQ(n) if the number of remaining el-ements ni1, i.e., S�Q(ni1) is greater than or equal to SQ(n).

When S�Q(ni1) < SQ(n), each element n in �Q will have atmost one descendant ni1, and the value of S�Q(n) is esti-mated as S�Q(ni1). Similarly, we can optimize the upperbound estimation with S�Q(ni+1). Therefore, given a query�Q where the target node n occurs in the trunk part, we have

S�Q(n) ≈ min(SQ(n), S�Q(ni1), S�Q(ni+1)) (5)

Preceding/Following Axis. The techniques to estimatequeries with preceding-sibling (following-sibling) axes canbe extended to process queries with preceding (following)axes. We can convert a query with preceding (follow-ing) axes into a set of XPath expressions involving onlypreceding-sibling (following-sibling) axes according to thepath ids of the nodes associated with preceding (following)axes after the path id join. Then the estimation result isgiven by the selectivity sum of the set of path expressions.Example 5.3: Suppose we issue the query //A[/C/foll::D] (foll denotes following axis) with target nodeD on the XML data in Figure 1. The path id join willassociate nodes A, C and D with path id sets {p6, p7},{p2, p3} and {p5} respectively. We check the path id p5

of node D. Since only the first bit of p5(1000) is 1, thepath between A and D must be A/B/D. Hence, the querycan be converted to the query with following-sibling axis//A[/C/folls::B/D]. �

6 Data Structures

The pathId-frequency table and path-order table areimplemented by the p-histogram and o-histogram respec-tively. Since both histograms are constructed from the pathids of the element nodes in an XML document, we designa binary tree to index the path ids.

Path ID Binary Tree. We use a binary tree to index the pathids. The structure of the binary tree is defined as follows:

1) The left and right edge in the binary tree represent thebit 0 and 1 respectively.

2) Each leaf node represents a path id, which is specifiedby the id (integer) associated with the node.

3) The bit sequence of the path id at each leaf node can beobtained by concatenating bits of all edges from rootnode to this leaf node.

4) The id attached with an internal node is the largest pathid in its left subtree. If the left subtree of an internalnode is empty, this node is attached with an integerthat is less than the least value in its right subtree.

2

2

1

0

41

3

4

5

5

5

6

7

8 8

8

96 7 8

4

43

Figure 6. Path Id Binary Tree

Example 6.1: Figure 6 shows the binary tree of the path idsin Figure 1(c). The leftmost internal node is assigned thevalue 0 while the least path id value in its right subtree is 1.The leaf node with id 2 denotes the path id 0010 which isobtained by concatenating the bits of all edges in the pathfrom the root to leaf. �

We navigate down the binary tree and compare the givenpath id with the ids of the internal nodes. We visit the leftchild if the given value is not greater than the node id, other-wise, we visit the right child. After reaching the leaf node,the concatenation of the bits of all edges traversed is the bitsequence of the given path id.

The binary tree can be compressed without loss ofinformation. If a left (right) subtree of an internal nodeonly contains left (right) edges, we remove the subtree andits incoming edge. This is because this left (right) subtreeonly represents a subsequence containing all 0 (1) bits. Forexample, in Figure 6, the dotted edges and the associatednodes can be safely removed.

P-Histogram. We build a p-histogram for each distinct el-ement tag to summarize the pathId-frequency information.Each bucket in a p-histogram contains a set of path ids andtheir average frequency value. To reduce the data skewnessinside a bucket, we require that the frequency variance ofeach bucket is not larger than a given variance threshold v.Given a set of pathid-frequency pairs (pi, fi) for an elementtag, the frequency variance vb of a bucket is defined as:

vb =

√(f1 − avgf)2 + ..(fk − avgf)2

k

where fi denotes the frequency of a path id pi, and k is thenumber of path ids in the bucket, and avgf =

∑fi/k.

Example 6.2: Figure 7 shows two p-histograms built onthe same pathid-frequency list with different given variance


values, 0 and 1. The variance with value 0 indicates theintra-bucket frequency must represent the correct frequencyvalues of the corresponding path ids. �

p2, p3

v = 0

fre = 5p1

v = 0

fre = 7

v = 0

P−Histogram2

v = 1

fre = 2p2, p3

v = 0

fre = 6p1, p5

v = 1

p5fre = 2

P−Histogram1v = 0

(p2, 2) (p3, 2) (p1, 5) (p5, 7)Path_id−Frequency

Figure 7. P-Histogram

Algorithm 1 gives an efficient heuristic algorithm tobuild p-histogram. The algorithm takes as input the pathId-frequency list for an element e and a variance threshold v.This pathId-frequency list is first sorted according to the fre-quency values. Next, we scan the list to find the longest sub-list such that its frequency variance is not greater than thegiven threshold v. The data in the longest sublist detectedis used to build a bucket. This detect-and-build procedure isrepeated until we finish scanning the pathId-frequency list.

Algorithm 1 P-Histogram ConstructionInput: PathId-fre list for element e and variance threshold vOutput: P-histogram for element e

1) Sort the pathId-fre list according to the frequency values.2) Create a bucket b. Scan and add the path ids into b until

the intra-bucket variance is larger than the given v.3) Repeat step 2 until finish scanning pathId-fre list.

O-Histogram. The o-histogram summarizes the path-orderinformation. The path-order table is very sparse since thefrequencies in the majority of the cells are 0. Hence, weonly need to store the cells with non-zero values.

A bucket in the o-histogram has the format of (x.start,y.start, x.end, y.end, frequency) where the first four vari-ables of the bucket describe a bounding box in the path-order table and the variable frequency denotes the averagefrequency value of all cells in this box. Similar to the p-histogram, the o-histogram also uses frequency variance toreduce intra-bucket data skewness.Example 6.3: Figure 8 shows an o-histogram that is builton the given path-order table with variance 1. This o-histogram has four buckets and they cover all the non-emptycells in the table. �

Algorithm 2 shows the details of constructing the o-histogram. First, we sort the given path-order table accord-ing to the alphabetic order of element tags and path id ordergenerated in p-histogram respectively. The sorted elementtags and path ids are encoded using integers. The purpose

of this step is that given a bounding box, we can find thecorresponding element tags and path ids.

7

1 2

1

8754321 6

5

1

2

3

4

5

6

7

8

Ele

O−Histogram (variance = 1)

Path Id

bucket1 (3, 2, 4, 3, 1)1

9 bucket3 (4, 4, 5, 4, 8)

bucket4 (6, 8, 6, 8, 5)

bucket2 (2, 3, 2, 3, 1)

Figure 8. O-Histogram

Next, all non-empty cells in the path-order table arescanned row-wise. For each non-empty cell, we extend itto a possible maximal box such that the frequency variancein the box is not greater than the given variance threshold.

Algorithm 2 O-Histogram ConstructionInput: Path-order table for element e and variance threshold vOutput: O-histogram for element e

1) Sort path-order table.

• Element tags: alphabetic order.• Path ids: Path ids order in p-histogram for e.

2) Scan non-empty cells to get a bounding box (bucket).

• Extend the current cell to a row of cells.• Extend this row to a box of cells.

3) Repeat step 2 until finish scanning all non-empty cells.

The detection of the maximal box is performed in twosteps. First, we extend the current cell to a row of cells.This extension will stop if we encounter an empty cell, orthe next cell is inside some other well-built bucket. Second,this row of cells is extended to a box by adding the rowsabove this row to the bucket until a empty-row (all cells areempty) is reached. In each step of the extension, we mustguarantee that the variance in the box (or in just one row) isnot larger than the given variance value.

7 Experiments

We carry out experiments to evaluate the performanceof proposed techniques in terms of memory space re-quirement, summary construction time and estimationaccuracy. The techniques are implemented in C++ and theexperiments are carried out on a Pentium IV 2.4 GHz CPUwith 1 GB RAM. The operating system is Windows XP.

Datasets. We use both real-world and synthetic datasets.Table 1 shows the characteristics of the datasets, Shake-spears’s Play (SSPlays) [1], DBLP [2] and XMark [3].


Dataset Size �(Distint Eles) �(Eles)SSPlays 7.5 MB 21 179,690DBLP 65.2 MB 31 1,711,542XMark 20.4 MB 74 319,815

Table 1. Characteristics of Datasets

Query Workload. We first generate 4000 simple queriesand 4000 branch queries without order axes for eachdataset. The simple queries are generated by randomly se-lecting the subsequences of the root-to-leaf paths from theencoding table. The branch queries are produced by merg-ing any two of these subsequences if they have commonnodes. The query sizes (number of nodes) vary from 3 to12. Duplicate queries and negative queries are removed toobtain a reasonable average relative error.

Datasets Without Order With OrderSimple Branch Total

SSPlays 188 2328 2516 1168DBLP 202 1013 1215 646XMark 1358 2686 4044 1654

Table 2. Query Workload

Next, we generate queries with order axes by fixing theorder between the sibling nodes of the generated branchqueries and then eliminating negative ones from them. Ta-ble 2 shows the number of queries obtained.

7.1 Memory Space Requirement

We first evaluate the space requirement of the encodingtable and path id binary tree. Table 3 shows that the sizes ofthe encoding table are very small for all datasets. We alsoshow the size of the path id table. The real-world datasetsrequire very limited space due to their regular structures.The binary tree is able to save about 78% of the space re-quirement for the XMark dataset. This is because XMarkhas a large number of distinct paths, leading to long pathids. Many large subtrees with only left (or right) edges canbe trimmed.

Figure 9 shows the memory usage of p-histogram atvarying intra-bucket variance values. We see that all thedatasets have similar curves. The p-histogram memory us-age decreases when the variance value is varied from 0 to 4.The dataset XMark needs more space compared to the otherdatasets since it has more element tags and distinct path ids.

Figure 9 also shows that the o-histogram memory us-age for all three datasets decreases as the o-histogramintra-bucket variance grows. Both the p-histogram and o-histogram require nearly the same memory space for SS-Plays and XMark datasets while we see a sharp increaseof memory usage from p-histogram to o-histogram for the

Dataset �(Dist Paths) Pid Size (Byte) �(Dist Pid)SSPlays 40 5 115DBLP 87 11 327XMark 344 43 6811

Dataset EncTab (KB) PidTab (KB) Pid Bin-Tree (KB)SSPlays 0.24 0.92 0.93DBLP 0.39 3.60 2.97XMark 2.90 299.7 67.3

Table 3. Space Requirement of Encoding Ta-ble and Path Id Binary Tree

DBLP dataset. This is because the data distribution inDBLP is shallower and wider than the other datasets. Asa result, the large number of sibling nodes in DBLP gener-ates more order information to be stored.

7.2 Summary Construction Time

We also evaluate the cost of building the summary datafor selectivity estimation. Table 4 shows the constructiontime of the proposed solution and XSketch [12] for querieswithout order-axes. The XMark dataset requires the longesttime to summarize the path information since it has themost number of distinct root-to-leaf paths. We observethat the p-histogram construction time is almost negligiblefor all three datasets because our algorithm scans the path-frequency information only once.

Proposed Path-Based SolutionDataset Collecting P-Histo P-Histo

Path Time Size Construction TimeSSPlays 1.6 s 0.55 ∼ 0.75 kb <0.001 sDBLP 78.4 s 1.4 ∼ 2.1 kb <0.001 sXMark 246.2 s 20.4 ∼ 24.6 kb <0.001 s

XSketchDataset Collecting Statistics Statistics

Data Time Size Construction TimeSSPlays 32.3 s 1.6 ∼ 2kb 2 ∼ 3 sDBLP 390.7 s 4.8 ∼ 5.8 kb 19 ∼ 30 sXMark 197.7 s 90 ∼ 95 kb > 1 week

Table 4. Time for Queries without Order Axes

Dataset Collecting O-Histo O-HistoOrder Time Size Construction Time

SSPlays 2.2 s 1.2 ∼ 1.8 kb 0.002 ∼ 0.003 sDBLP 4574.8 s 7.4 ∼ 12.7 kb 0.02 ∼ 0.03 sXMark 2347.2 s 11 ∼ 21.3 kb 1.2 ∼ 2.1 s

Table 5. Construction Time for Order Data

In contrast, [12] utilizes a greedy refinement strategy to


0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

0 2 4 6 8 10 12 14

Mem

ory

Usa

ge (

KB

)

Intra-Bucket Variance

O-HistoP-Histo

(a) SSPlays

2

4

6

8

10

12

0 2 4 6 8 10 12 14

Mem

ory

Usa

ge (

KB

)


O-HistoP-Histo

(b) DBLP

10

12

14

16

18

20

22

24

0 2 4 6 8 10 12 14

Mem

ory

Usa

ge (

KB

)


O-HistoP-Histo

(c) XMark

Figure 9. P-Histogram and O-Histogram Memory Usage

0

0.2

0.4

0.6

0.8

1

0.55 0.6 0.65 0.7 0.75

Rel

ativ

e E

rror

P-Histogram Memory Usage (KB)

simple queriesbranch queries

all queries

(a) SSPlays

0

0.2

0.4

0.6

0.8

1

1.4 1.6 1.8 2

Rel

ativ

e E

rror



all queries

(b) DBLP

0

0.1

0.2

0.3

0.4

0.5

20 21 22 23 24

Rel

ativ

e E

rror



all queries

(c) XMark

Figure 10. Estimation Error of Queries without Order Axes

0

0.2

0.4

0.6

0.8

1

1.6 1.7 1.8 1.9 2

Rel

ativ

e E

rror

Total Memory Usage (KB)

p-histoxsketch

(a) SSPlays

0

0.2

0.4

0.6

0.8

1

4.8 5 5.2 5.4 5.6 5.8

Rel

ativ

e E

rror


p-histoxsketch

(b) DBLP

0

0.1

0.2

0.3

0.4

0.5

90 91 92 93 94 95

Rel

ativ

e E

rror


p-histoxsketch

(c) XMark

Figure 11. P-Histogram Vs XSketch

0

0.1

0.2

0.3

0.4

0.5

0.6

1.2 1.3 1.4 1.5 1.6 1.7 1.8

Rel

ativ

e E

rror

O-Histogram Memory Usage (KB)

p-histo.v=0p-histo.v=1p-histo.v=5

p-histo.v=10

(a) SSPlays

0

0.2

0.4

0.6

0.8

8 9 10 11 12 13

Rel

ativ

e E

rror



p-histo.v=10

(b) DBLP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

12 14 16 18 20 22

Rel

ativ

e E

rror



p-histo.v=10

(c) XMark

Figure 12. Estimation Error of Queries with Order Axes (Branch Part)


0

0.1

0.2

0.3

0.4

0.5

0.6

1.2 1.3 1.4 1.5 1.6 1.7 1.8

Rel

ativ

e E

rror



p-histo.v=10

(a) SSPlays

0

0.2

0.4

0.6

0.8

8 9 10 11 12 13

Rel

ativ

e E

rror



p-histo.v=10

(b) DBLP

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

12 14 16 18 20 22

Rel

ativ

e E

rror



p-histo.v=10

(c) XMark

Figure 13. Estimation Error of Queries with Order Axes (Trunk Part)

incrementally add complexity on the existing summary in-formation. Hence, the construction time will grow quicklywhen the statistics size increases. In Table 4, the worst casehappens for the XMark dataset with statistics size of 90-95KB. Note that we ensure the summary size of XSketch isapproximately the same as the total memory size of the en-coding table, path id binary tree and p-histogram.

Table 5 shows the construction time for order data. Com-pared to Table 4, more time is needed for all three datasetsdue to the huge amount of order information that needs tobe captured. The o-histogram building algorithm is efficientbecause of the single scan construction method.

7.3 Estimation Accuracy

We evaluate the accuracy of proposed solution on XPathqueries, and compare our approach with XSketch [12].

Queries without Order Axes. Figure 10 shows the esti-mation accuracy for queries without order axes. The mem-ory usage in the x-axis corresponds to the memory usage inthe y-axis of p-histogram in Figure 9. The last data points,which correspond to p-histogram.variance=0, shows thecorrect selectivity obtained for simple queries. We observethat the error of the branch queries is very low (< 7% for alldatasets) when the p-histogram variance is zero. The sim-ple queries have a lower estimation error compared to thebranch queries. The larger estimation error of the branchqueries arises from the estimated data (bucket frequency).Further, the branch queries do not satisfy the Node Inde-pendence Assumption.

We also compare the proposed estimation method withXSketch [12] for queries without order axes. Figure 11shows the total memory usage of our approach (encodingtable, path id binary tree and p-histogram). The curvesfor our method are shorter than that for XSketch. Thisis because the maximal memory usage occurs when thep-histogram variance is 0 for our method. We observethat if sufficient memory space is available, our method

outperforms XSketch. XSketch shows more accurateresults with low memory usage. This is because oursolution requires a minimum space to store the encodingtable and path id binary tree, and additional memory space(p-histogram) will increase the estimation accuracy, leadingto a significant decrease of the estimation error.

Queries with Order Axes. Next, we examine the estima-tion accuracy for queries with order axes. Figure 12 showsthe average relative error (target nodes in branch parts)when the memory usage of o-histograms varies. We setthe variance of p-histogram at 0, 1, 5 and 10, and plot fourcurves in each graph. We see that when the exact frequencyvalues are stored in p-histogram (p-histogram.variance=0),the relative errors for the three datasets are smaller than10% at o-histogram variance 2 (the o-histogram memoryusages are 1.4KB, 9.8KB and 14.8KB respectively for threedatasets), and the error rates can be further reduced to lessthan 6% when o-histogram variances are 0 (the memory us-ages are 1.8KB, 12.7KB and 21.3KB respectively).

The accuracy curves for the SSPlays and XMark datasetsare relatively flat at high p-histogram variance. This indi-cates that if the pathId-frequency information is not accu-rate (high p-histogram variance values), we cannot improvethe estimation accuracy by setting smaller o-histogram vari-ance (for more accurate order information). The curves forthe DBLP dataset are very flat, indicating that this dataset isnot sensitive to the o-histogram variance in all values of p-histogram variance. This is because more memory space isrequired to store the order information for the DBLP datasetcompared to path information.

Figure 13 shows the estimation accuracy results whenthe target nodes occur in the trunk parts of the queries. Weobserve that the estimation is reasonably accurate at lowp-histogram variances even if we set a high o-histogramvariance (low o-histogram memory usage). Compared withFigure 12, we can achieve lower estimation error for theSSPlays and XMark datasets at low p-histogram variancevalues and low o-histogram memory usage (see Figure 13).


This is because we use Equation 5 to estimate the selectivitywhich is the smallest value of the results of two order-basedand one non-order-based queries. With the low p-histogramvariance value, we can obtain accurate results for querieswithout order axes, which compensates for the loss of de-tailed order information.

Overall, the experimental results demonstrate the effec-tiveness of the proposed techniques which yield low estima-tion errors while requiring very limited amount of memory.We also show that the proposed techniques typically per-forms well when the p-histogram variance is set at 0-2 andthe o-histogram variance is set at 0-4.

8 Related Work

Existing research on XML selectivity estimation has fo-cused on queries without order axes [5, 6, 10, 11, 12, 13,16]. The methods proposed in [5, 10, 11] are based on theMarkov models. [11] stores the frequencies of all paths withlength up to k, which are aggregated to estimate the nodefrequency of longer paths. [5] proposes path tree whichis structurally similar to DataGuides [7]. Low frequencynodes are pruned in path trees. XPathLearner [10] utilizesquery feedback to collect the statistical information. TheseMarkov-based solutions are limited to simple path queries.

XSketch [12] extends XML tree models in [5] to graphs,and considers both simple paths and branch queries. [13]extends XSketch to support queries with value predicates.The work in [6] estimates twig queries by building a suffixtree for all the root-to-leaf paths. Every node in the tree isassociated with a hash signature which denotes the set ofnodes on the path rooted at this node. The hash signatureis used to calculate the frequency of twig queries which aremerged from multiple simple paths.

[16] presents a position histogram approach. A two-dimensional position histogram is built on either the ele-ment tag or element content of each element. A position his-togram join is carried out to estimate the query result sizesbased on the node interval containment relationship. Sinceonly containment information between nodes is captured,this approach cannot distinguish between parent-child andancestor-descendant relationships.

9 Conclusion

To the best of our knowledge, this is the first work thatprovides a uniform framework to estimate the selectivity ofboth XPath expressions with and without order axes. Wecapture the path information where the element occurs andutilize a join based method to estimate queries without or-der axes. The order information of each element tag is sum-marized by using a path-order table. We design two his-

tograms, namely, p-histogram and o-histogram, to summa-rize the path and order information respectively. Extensiveexperimental evaluation on both real-world and syntheticdatasets clearly demonstrates the effectiveness of the pro-posed approach.

References

[1] http://www.ibiblio.org/xml/examples/shakespeare.

[2] http://www.informatik.uni-trier.de/˜ley/db/.

[3] http://monetdb.cwi.nl/xml/downloads.html.

[4] XML Path Language. http://www.w3.org/TR/xpath.

[5] A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Es-timating the Selectivity of XML Path Expressions for Inter-net Scale Applications. In 27th International Conference onVery Large Data Bases, 2001.

[6] Z. Chen, H. V. Jagadish, F. Korn, and N. Koudas. CountingTwig Matches in a Tree. In 17th IEEE International Confer-ence on Data Engineering, 2001.

[7] R. Goldman and J. Widom. DataGuides: Enabling QueryFormulation and Optimization in Semistructured Databases.In 23rd International Conference on Very Large Data Bases,1997.

[8] H. Li, M. Lee, and W. Hsu. A Path-Based Labeling Schemefor Efficient Structural Join. In Third International XMLDatabase Symposium, 2005.

[9] Q. Li and B. Moon. Indexing and Querying XML Data forRegular Path Expressions. In 27th International Conferenceon Very Large Data Bases, 2001.

[10] L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr.XPathLearner: An On-Line Self-Tuning Markov Histogramfor XML Path Selectivity Estimation. In 28th InternationalConference on Very Large Data Bases, 2002.

[11] J. McHugh and J. Widom. Query Optimization for XML.In 25th International Conference on Very Large Data Bases,1999.

[12] N. Polyzotis and M. Garofalakis. Statistical Synopses forGraph-Structured XML Database. In ACM SIGMOD, 2002.

[13] N. Polyzotis and M. N. Garofalakis. Structure and ValueSynopses for XML Data Graphs. In 28th International Con-ference on Very Large Data Bases, 2002.

[14] I. Tatarinov, S. Viglas, K. S. Beyer, J. Shanmugasundaram,E. J. Shekita, and C. Zhang. Storing and Querying OrderedXML Using a Relational Database System. In ACM SIG-MOD, 2002.

[15] X. Wu, M. Lee, and W. Hsu. A Prime Number LabelingScheme for Dynamic Ordered XML Trees. In 20th IEEEInternational Conference on Data Engineering, 2004.

[16] Y. Wu, J. M. Patel, and H. V. Jagadish. Estimating AnswerSizes for XML Queries. In 8th International Conference onExtending Database Technology, 2002.

[17] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M.Lohman. On Supporting Containment Queries in RelationalDatabase Management Systems. In ACM SIGMOD, 2001.


Download - [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - An Estimation

Top Related