processing and evaluating partial tree pattern queries on xml data

16
Processing and Evaluating Partial Tree Pattern Queries on XML Data Xiaoying Wu, Stefanos Souldatos, Dimitri Theodoratos, Theodore Dalamagas, Yannis Vassiliou, and Timos Sellis, Fellow, IEEE Abstract—XML query languages typically allow the specification of structural patterns using XPath. Usually, these structural patterns are in the form of trees (Tree-Pattern Queries—TPQs). Finding the occurrences of such patterns in an XML tree is a key operation in XML query evaluation. The multiple previous algorithms presented for this operation focus mainly on the evaluation of tree-pattern queries. Recently, requirements for flexible querying of XML data have motivated the consideration of query classes that are more expressive and flexible than TPQs for which efficient nonmain-memory evaluation algorithms are not known. In this paper, we consider a class of queries, called Partial Tree-Pattern Queries (PTPQs), which generalize and strictly contain TPQs. PTPQs represent a broad fragment of XPath which is very useful in practice. In order to process PTPQs, we introduce a set of sound and complete inference rules to characterize structural relationship derivation. We provide necessary and sufficient conditions for detecting query unsatisfiability and node redundancy. We also show that PTPQs can be represented as directed acyclic graphs augmented with the “same-path” constraints. In order to leverage existing efficient evaluation algorithms for less expressive classes of queries, we design two approaches that evaluate a PTPQ by decomposing it into a set of simpler queries: algorithm IndexT P QGen, exploits a structural summary of the XML data and evaluates a PTPQ by generating an equivalent set of TPQs and unioning their answers. Algorithm P artialP athJ oin decomposes the PTPQ into partial-path queries, and merge-joins their solutions. We also develop P artialT reeStack, an original polynomial time holistic algorithm for PTPQs. To the best of our knowledge, this is the first algorithm to support the evaluation of such a broad structural fragment of XPath in the inverted lists evaluation model. We provide a theoretical analysis of our algorithm and identify cases where it is asymptotically optimal. An extensive experimental evaluation shows that it is more efficient, robust, and stable than the other two and it outperforms a state-of-the art XQuery engine on PTPQs. Index Terms—XML query processing, XPath query evaluation, tree-pattern query, partial tree-pattern query Ç 1 INTRODUCTION Q UERY languages for XML data typically allow the specification of structural patterns of elements. In practice, these structural patterns are specified using XPath [1], a language that lies at the core of the standard XML query language XQuery [1]. Usually, the structural patterns are in the form of trees (Tree-Pattern Queries—TPQs). A restrictive characteristic of TPQs is that they impose a total order for the nodes in every path of the query pattern. However, recent applications of XML require querying of data whose structure is complex [52] or is not fully known to the user [32], [43], [45], or integrating XML data sources with different structures [24], [32], [43]. In order to satisfy these requirements, different approaches are adopted that range from using unstructured keyword queries [24] to extending XQuery with keyword search capabilities [4], [32]. TPQs are not expressive enough to specify these new types of queries. Larger subclasses of XPath are required for which, up to now, efficient nonmain-memory evaluation algorithms are not known. Suppose, for instance, that a user wants to find information about the title, year of publication, and genre of the books written by the author named “John Smith.” This information has to be extracted from multiple XML data sources that export bibliography data on the web and categorize, as is the case in practice, book information differently. Some of the data sources categorize their books by genre and then by year and author, others categorize them by year and author and then by genre, etc. The different categorizations correspond to a multitude of ways for structuring the exported XML trees. Since the three elements year, author, and genre are part of the categoriza- tion hierarchy, they occur on the same path of the XML tree no matter what categorization is chosen. The user knows also that the author name element occurs always below the author element and that the same holds for the element that contains information about the book title. An analogous situation appears when the user wants to extract the book information from a single XML data source but is not aware, or has partial knowledge, of the structure of the data source. A query for such a request can be expressed in XQuery and one such formulation is shown in Fig. 1. 2244 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012 . X. Wu is with Wuhan University, State Key Laboratory of Software Engineering, China. E-mail: [email protected]. . D. Theodoratos is with the Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102. E-mail: [email protected]. . S. Souldatos and Y. Vassiliou are with the Divison of Computer Science, School of Electrical and Computer Engineering, National Technical University of Athens, Iroon Polytechniou 9, Politechnioupoli Zographou, Athens 157 80, Greece. E-mail: [email protected], [email protected]. . T. Dalamagas and T. Sellis are with the Institute for the Management of Information Systems (IMIS), Research Center “Athena,” Bakou 17, Athens 11524, Greece. E-mail: {dalamag, timos}@imis.athena-innovation.gr. Manuscript received 1 Aug. 2009; revised 20 May 2010; accepted 16 Apr. 2011; published online 20 June 2011. Recommended for acceptance by J. Yang. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2009-08-0582. Digital Object Identifier no. 10.1109/TKDE.2011.137. 1041-4347/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

Upload: timos

Post on 09-Feb-2017

221 views

Category:

Documents


8 download

TRANSCRIPT

Processing and Evaluating Partial TreePattern Queries on XML Data

Xiaoying Wu, Stefanos Souldatos, Dimitri Theodoratos, Theodore Dalamagas,

Yannis Vassiliou, and Timos Sellis, Fellow, IEEE

Abstract—XML query languages typically allow the specification of structural patterns using XPath. Usually, these structural patterns

are in the form of trees (Tree-Pattern Queries—TPQs). Finding the occurrences of such patterns in an XML tree is a key operation in

XML query evaluation. The multiple previous algorithms presented for this operation focus mainly on the evaluation of tree-pattern

queries. Recently, requirements for flexible querying of XML data have motivated the consideration of query classes that are more

expressive and flexible than TPQs for which efficient nonmain-memory evaluation algorithms are not known. In this paper, we consider

a class of queries, called Partial Tree-Pattern Queries (PTPQs), which generalize and strictly contain TPQs. PTPQs represent a broad

fragment of XPath which is very useful in practice. In order to process PTPQs, we introduce a set of sound and complete inference

rules to characterize structural relationship derivation. We provide necessary and sufficient conditions for detecting query

unsatisfiability and node redundancy. We also show that PTPQs can be represented as directed acyclic graphs augmented with the

“same-path” constraints. In order to leverage existing efficient evaluation algorithms for less expressive classes of queries, we design

two approaches that evaluate a PTPQ by decomposing it into a set of simpler queries: algorithm IndexTPQGen, exploits a structural

summary of the XML data and evaluates a PTPQ by generating an equivalent set of TPQs and unioning their answers. Algorithm

PartialPathJoin decomposes the PTPQ into partial-path queries, and merge-joins their solutions. We also develop PartialTreeStack,

an original polynomial time holistic algorithm for PTPQs. To the best of our knowledge, this is the first algorithm to support the

evaluation of such a broad structural fragment of XPath in the inverted lists evaluation model. We provide a theoretical analysis of our

algorithm and identify cases where it is asymptotically optimal. An extensive experimental evaluation shows that it is more efficient,

robust, and stable than the other two and it outperforms a state-of-the art XQuery engine on PTPQs.

Index Terms—XML query processing, XPath query evaluation, tree-pattern query, partial tree-pattern query

Ç

1 INTRODUCTION

QUERY languages for XML data typically allow thespecification of structural patterns of elements. In

practice, these structural patterns are specified using XPath[1], a language that lies at the core of the standard XMLquery language XQuery [1]. Usually, the structural patternsare in the form of trees (Tree-Pattern Queries—TPQs). Arestrictive characteristic of TPQs is that they impose a totalorder for the nodes in every path of the query pattern.However, recent applications of XML require querying ofdata whose structure is complex [52] or is not fully knownto the user [32], [43], [45], or integrating XML data sourceswith different structures [24], [32], [43]. In order to satisfy

these requirements, different approaches are adopted thatrange from using unstructured keyword queries [24] toextending XQuery with keyword search capabilities [4],[32]. TPQs are not expressive enough to specify these newtypes of queries. Larger subclasses of XPath are required forwhich, up to now, efficient nonmain-memory evaluationalgorithms are not known.

Suppose, for instance, that a user wants to findinformation about the title, year of publication, and genreof the books written by the author named “John Smith.”This information has to be extracted from multiple XMLdata sources that export bibliography data on the web andcategorize, as is the case in practice, book informationdifferently. Some of the data sources categorize their booksby genre and then by year and author, others categorizethem by year and author and then by genre, etc. Thedifferent categorizations correspond to a multitude of waysfor structuring the exported XML trees. Since the threeelements year, author, and genre are part of the categoriza-tion hierarchy, they occur on the same path of the XML treeno matter what categorization is chosen. The user knowsalso that the author name element occurs always below theauthor element and that the same holds for the element thatcontains information about the book title. An analogoussituation appears when the user wants to extract the bookinformation from a single XML data source but is notaware, or has partial knowledge, of the structure of the datasource. A query for such a request can be expressed inXQuery and one such formulation is shown in Fig. 1.

2244 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

. X. Wu is with Wuhan University, State Key Laboratory of SoftwareEngineering, China. E-mail: [email protected].

. D. Theodoratos is with the Computer Science Department, New JerseyInstitute of Technology, University Heights, Newark, NJ 07102.E-mail: [email protected].

. S. Souldatos and Y. Vassiliou are with the Divison of Computer Science,School of Electrical and Computer Engineering, National TechnicalUniversity of Athens, Iroon Polytechniou 9, Politechnioupoli Zographou,Athens 157 80, Greece. E-mail: [email protected], [email protected].

. T. Dalamagas and T. Sellis are with the Institute for the Management ofInformation Systems (IMIS), Research Center “Athena,” Bakou 17, Athens11524, Greece. E-mail: {dalamag, timos}@imis.athena-innovation.gr.

Manuscript received 1 Aug. 2009; revised 20 May 2010; accepted 16 Apr.2011; published online 20 June 2011.Recommended for acceptance by J. Yang.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2009-08-0582.Digital Object Identifier no. 10.1109/TKDE.2011.137.

1041-4347/12/$31.00 � 2012 IEEE Published by the IEEE Computer Society

Fig. 2 shows a graphical representation of this query.This representation is useful because it reveals a restrictionwhich is required for specifying such a query and is calledhere the same-path constraint. This constraint restricts anumber of query nodes which are not necessarily involvedin a structural relationship to map always to nodes on thesame XML tree path. This is the case of nodes year, author,and genre in Fig. 2 which are surrounded by a dotted linelabeled the same-path. Because of the same-path constraint,such queries cannot be expressed as tree-pattern queries oreven as graph queries.

In this paper, we consider a query language for XML,called Partial Tree-Pattern Query (PTPQ) language which isable to express such queries. PTPQs generalize and strictlycontain TPQs. They are flexible enough to allow a largerange of queries from keyword-style queries with nostructure, to keyword queries with arbitrary structuralconstraints, to fully specified TPQs. PTPQs are notrestricted by a total order for the nodes in a path of thequery pattern since they can constrain a number of(possibly unrelated) nodes to lie on the same path (same-path constraint). Overall, PTPQs represent a broad fragmentof XPath which is very useful in practice.

A broad fragment of XPath such as PTPQs can be usefulonly if it is complemented with efficient evaluationtechniques. A growing number of XML applications, inparticular data-centric applications, handle documents toolarge to be processed in memory. A recent approach for thenonmain-memory evaluation of queries on XML dataassumes that the data are preprocessed and an invertedlist of the regional encoding [2], [9], [27] of the nodes is builtfor every node label of the XML document tree. We refer tothis evaluation model as inverted lists model. The advantageof the inverted lists evaluation is that it can process largeXML documents without preloading them in the memory(nonmain-memory evaluation). Unfortunately, existingnonmain-memory evaluation algorithms focus almost ex-clusively on TPQs.

Problem addressed. In this paper, we undertake the taskof designing an efficient evaluation algorithm for PTPQs inthe inverted lists model. This task is complex: as we showlater, due to their expressive power, PTPQs can only berepresented as directed acyclic graphs (dags) annotatedwith the same-path constraints. Matching these querypatterns to XML trees requires the appropriate handlingof both the structural constraints of the dag, and the same-path constraints. These two types of constraints can beconflicting: a matching that satisfies the structural con-straints of the dag may violate the same-path constraints,and vice versa.

One might wonder whether existing techniques can beused for efficiently evaluating PTPQs. In fact, as we showlater in the paper, a PTPQ is equivalent to a set of TPQs forwhich efficient algorithms exist. Unfortunately, this trans-formation leads to a number of TPQs which, in the worstcase, is exponential on the size of the PTPQ. Ourexperimental results show that another technique thatdecomposes the PTPQ dag into simpler query patterns,which can be evaluated efficiently, also fails to producesatisfactory performance.

Contribution. The main contributions are the following:

. Because the structure of a tree may not be fullyspecified in a PTPQ, new structural expressions canbe derived from those explicitly specified in thequery. These structural expressions are important inquery processing. We define a sound and completeset of inference rules to characterize structuralrelationship derivation in PTPQs (Section 3.1).

. Unlike tree-pattern queries, PTPQs can be unsatisfi-able. We provide necessary and sufficient conditionsfor detecting query unsatisfiability (Section 3.2).

. PTPQs may contain redundant nodes, i.e., nodes thatalways have the same matching to the XML data asother nodes of the query. Ignoring redundant nodesaccelerates query evaluation. We provide conditionsfor efficiently identifying redundant nodes (Appen-dix, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.137).

. We use a formalism to represent PTPQs as directedacyclic graphs (dags) annotated with the same-pathconstraints (Section 2.2). We show that PTPQs canexpress a broad fragment of XPath which comprisesreverse axes and the node identity equality (is-same-node) operator in addition to forward axes andpredicates (Section 2.3).

. In order to leverage existing state-of-the-art evalua-tion techniques, we design two approaches thatdecompose the PTPQ into a set of simpler queriesfor which efficient algorithms exist: algorithmIndexTPQGen, exploits a structural summary of theXML data to generate a set of TPQs equivalent to thegiven PTPQ, and computes the answer of the PTPQ bytaking the union of their answers (Section 4.1).Algorithm PartialPathJoin decomposes the PTPQinto partial-path queries (PPQ) [42], and computes theanswer of the PTPQ by merge-joining their solutions(Section 4.2).

. We develop an efficient holistic evaluation algorithmfor PTPQs calledPartialTreeStack.PartialTreeStacktakes into account the annotated dag form of PTPQsand avoids checking whether node matches satisfy

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2245

Fig. 1. XQuery expression.

Fig. 2. A graphical representation of the bibliography query.

the same-path constraint when it can derive that theyviolate the dag structural constraints (Section 4.1).

. We provide a theoretical analysis ofPartialTreeStackto show its polynomial time and space complexity.We further show that under the reasonable assump-tion that the size of queries is not significantcompared to the size of data, PartialTreeStack isasymptotically optimal for PTPQs without parent-child structural relationships (Section 4.2)

. We implemented all three algorithms and conducteddetailed experiments to compare their performance.The experimental results show that PartialTreeStackoutperforms the other two algorithms (Section 7.1) onnontrivial PTPQs and is more robust and stable. Wealso experimentally show that PartialTreeStackoutperforms a state-of-the-art XQuery engine onPTPQs even though no indexing techniques areemployed (Section 7.2).

. To the best of our knowledge, PartialTreeStack isthe first algorithm in the inverted lists model thatsupports such a broad fragment of XPath.

2 PARTIAL TREE PATTERN QUERY LANGUAGE

In this section, we discuss the XML data model and thepartial tree pattern query language.

2.1 Data Model

XML data are commonly modeled by a tree structure. Treenodes are labeled and represent elements, attributes, orvalues. Tree edges represent element-subelement, element-attribute, and element-value relationships. Let L be the setof node labels. Without loss of generality, we assume thatonly the root node of every XML tree is labeled by r 2 L. Wedenote XML tree labels by lower case letters. To distinguishbetween nodes with the same label, node labels in the XMLtree may have a subscript. Fig. 3 shows an XML tree. Thetriplets next to the nodes encode their position in the tree,and they are explained below.

Positional representation. For XML trees, we adopt thepositional representation widely used for XML queryprocessing [53], [2], [9], [27]. The positional representationassociates with every node a triplet (start, end, level) ofvalues. The start and end values of a node are integers whichcan be determined through a depth-first traversal of theXML tree, by sequentially assigning numbers to the firstand the last visit of the node. The level value represents thelevel of the node in the XML tree.

The positional representation allows efficiently checkingstructural relationships between two nodes in the XML tree.

For instance, given two nodes n1 and n2, n1 is an ancestor ofn2 iff n1:start < n2:start, and n2:end < n1:end. Node n1 isthe parent of n2 iff n1:start < n2:start, n2:end < n1:end, andn1:level ¼ n2:level� 1.

In this paper, we often need to check whether a numberof nodes in an XML tree lie on the same path. This check canbe performed efficiently using the following proposition.

Proposition 2.1. Given a set of nodes n1; . . . ; nk in an XML treeT , let maxStart and minEnd denote, respectively, themaximum start and the minimum end values in the positionalrepresentations of n1; . . . ; nk. Nodes n1; . . . ; nk lie on the samepath in T iff maxStart � minEnd.

2.2 Query Language

We now introduce the syntax and semantics of our queries.Syntax. A Partial Tree-Pattern Query specifies a pattern

which partially determines a tree. PTPQs comprise nodesand child and descendant relationships between nodes. Thenodes are grouped into disjoint sets called partial paths(PPs). PTPQs are embedded to XML trees. The nodes of apartial path are embedded to nodes on the same XML treepath. However, unlike paths in TPQs, the child anddescendant relationships in partial paths do not necessarilyform a total order. This is the reason for qualifying thesepaths as partial. We start by defining Partial Path Querieswhich are PTPQs with a single partial path.

Definition 2.1 (PPQ). Let N be an infinite set of labeled nodes.Nodes in N are labeled by labels in L. Let X and Y denotedistinct nodes in N . A partial path query is a finite set ofexpressions of the form X=Y (child relationship) or X==Y(descendant relationship). Child and descendant relation-ships are collectively called precedence relationships.

Fig. 4 shows five example PPQs. Note that the labels ofthe query nodes are denoted by capital letters to distinguishthem from the labels of the XML tree nodes. In this sense,label l in an XML tree and label L in a query represent thesame label. We can represent a PPQ as a node-labeledgraph. Single (resp. double) arrows correspond to child(resp. descendant) relationships. Fig. 5 shows the graphrepresentation of the queries of Fig. 4. Notice that a PPQgraph can be disconnected, e.g., query Q4 in Fig. 5d. With a

2246 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

Fig. 3. XML tree.

Fig. 4. Partial path queries.

Fig. 5. Graph representation of PPQs.

PPQ, the user can flexibly specify the structure of a path in a

query fully, partially, or not at all.PTPQs also comprise node sharing expressions. A node

sharing expression indicates that two nodes from different

partial paths are to be embedded to the same XML tree

node. That is, the image of these two nodes is the

same—shared—node in the XML tree. The formal definition

of PTPQs follows.

Definition 2.2 (PTPQ). A partial tree-pattern query is a pair

ðS; NÞ where

. S is a list of n named sets p1; . . . ; pn called partialpaths. Each PP pi is a finite set of precedencerelationships. We write X½pi�=Y ½pi� (resp. X½pi�==Y ½pi�) to indicate that X=Y (resp. X==Y ) is arelationship in PP pi.

. N is a set of node sharing expressions X½pi� � Y ½pj�,where pi and pj are distinct PPs, and X and Y arenodes in PPs pi and pj, respectively, such that both ofthem are labeled by the same label in L. Precedencerelationships and node sharing expressions are collec-tively called structural relationships.

Fig. 6a shows a PTPQ Q and Fig. 6b shows a visual

representation for Q. This representation depicts partial

paths as graphs and connects nodes from different partial

paths participating in node sharing expressions through

edges labeled by � . We use this representation later on in

Section 4 to design an algorithm for evaluating PTPQs.

Unless otherwise indicated, in the following, “query” refers

to a PTPQ.Semantics. The answer of a PTPQ on an XML tree is a set

of tuples of nodes from the XML tree that satisfy the

structural relationships and the same-path constraints of the

PTPQ. Formally:

Definition 2.3 (Query embedding). An embedding of a

query Q into an XML tree T is a mapping M from the nodes of

Q to nodes of T such that

1. a node A½pj� in Q is mapped by M to a node of Tlabeled by a;

2. the nodes of Q in the same PP are mapped by M tonodes that lie on the same path in T ;

3. 8 X½pi�=Y ½pi� (resp. X½pi�==Y ½pi�) in Q, MðY ½pi�Þ is achild (resp. descendant) of MðX½pi�Þ in T ;

4. 8 X½pi� � Y ½pj� in Q, MðX½pi�Þ and MðY ½pj�Þcoincide in T .

We call image of Q under an embedding M a tuple that

contains one field per node in Q, and the value of the field is

the image of the node under M. Such a tuple is also calledsolution of Q on T .

Definition 2.4 (Query answer). The answer of Q on T is theset of solutions of Q under all possible embeddings of Q to T .

Consider, for instance, query Q2 of Fig. 5b and query Qof Fig. 6a. The answer of Q2 on the XML tree of Fig. 3 is:{hR : r, A1 : a1, C1 : c2i, hR : r, A1 : a3, C1 : c2, i, hR : r, A1: a9,C1 : c10i}, and that of Q is: {hR : r, A1 : a9, A2 : a9, A3 : a9,B1 : b12, B2 : b12, C1 : c10, C2 : c10, D : d13, E : e14, F : f11i}.

Query Q1 in Fig. 5 is a PPQ which is also a path-patternquery since the structural relationships in the query inducea total order for the query nodes. Query Q2 is syntacticallysimilar to a tree-pattern query (twig). However, thesemantics is different: when query Q2 is a PPQ, the imagesof the query nodes A1 and C1 should lie on the same pathon the XML tree.

2.3 Generality of Partial Tree Pattern QueryLanguage

Clearly, the class of PTPQs cannot be expressed by TPQs and

this also holds even for PPQs. For instance, PPQs can

constrain a number of nodes in a query pattern to belong to

the same path even if there is no precedence relationship

between these nodes in the PPQ. Such a query cannot be

expressed by a TPQ. TPQs correspond to the fragment

XPf½�;=;==g of XPath that involves predicates([]), and child (/)

and descendant (//) axes. In fact, it is not difficult to see that

PTPQs cannot be expressed either by the larger fragment

XPf½�;=;==;n;nng of XPath that involves, in addition, the reverse

axes parent (n) and ancestor (nn). On the other hand, PTPQs

represent a very broad fragment XPf½�;=;==;n;nn;�g of XPath that

corresponds to XPf½�;=;==;n;nng augmented with the is opera-

tion (� ) of XPath2 [1]. The is operator is a node identity

equality operator. The conversion of an expression in

XPf½�;=;==;n;nn;�g to an equivalent PTPQ is straightforward.

There is no previous inverted lists evaluation algorithm that

directly supports such a broad fragment of XPath.Note that as the next proposition shows, a PTPQ is

equivalent to a set of TPQs.

Proposition 2.2. Given a PTPQ Q, there is a set of TPQsQ1; . . . ; Qn in XPf½�;=;==g such that for every XML tree T , theanswer of Q on T is the union of the answers of the Qis on T .

As an example, Fig. 7 shows the two TPQs for query Q ofFig. 6b, which together are equivalent to Q. These TPQs areobtained by merging nodes that participate in a nodesharing expression and by choosing compatible topological

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2247

Fig. 6. A PTPQ and its visual representation.Fig. 7. The two TPQs corresponding to the PTPQ Q of Fig. 6.

orders for the nodes that belong to the same partial paths.Based on the previous proposition, one can considerevaluating PTPQs using existing algorithms for TPQs. InSection 4.1, we present such an algorithm. However, thenumber of TPQs that need to be evaluated can grow to belarge (in the worst case, it can be exponential on the numberof nodes of the PTPQ).

3 QUERY PROCESSING

As the structure of a path can be partially specified, andnodes can be shared between different partial paths inPTPQs, new structural relationships may be inferred fromthose explicitly specified in the query. Further, unlike tree-pattern queries, PTPQs may be unsatisfiable and they mayinclude redundant nodes. Inferred structural relationshipsare necessary in detecting unsatisfiable queries and re-dundant nodes. In this section, we address these issues andwe show how a PTPQ can be processed and put in anannotated graph form which is convenient for evaluation.

3.1 Structural Relationship Inference

Consider query Q4 in Fig. 5d. Since C2 is a parent of E1 andan ancestor of A2, we can infer that E1 is an ancestor of A2

as well. Indeed, since E1 is a child of C2, A2 can not beplaced between them on a path. Next, we formalize theinference of structural relationships.

Definition 3.1. A structural relationship s is derived from aquery Q iff for every embedding M of Q to any XML tree, Msatisfies s. The closure of Q is the set that comprises all thestructural relationships that can be derived from Q. A query isin full form if it is equal to its closure.

In order to characterize the derivation of structuralrelationships and compute closures of queries, we introducea set of inference rules. We start with inference rules forstructural relationships in a single partial path (that is,precedence relationships). Let X, Y , and Z denote notnecessarily distinct labels. Let Xi, Xj, Yk, Yl, and Zm denotedistinct query nodes, where Xi and Xj are labeled by X, Yk,and Yl are labeled by Y , and Zm is labeled by Z. Recall thatR denotes the root node of a query. In an inference rule, therelationships that precede symbol ‘ infer the relationshipthat follows it. The absence of expressions that precede ‘denotes an axiom. The inference rules for a partial path areshown in Fig. 8.

PTPQs may contain node sharing expressions whichinvolve nodes from different partial paths. Fig. 9 shows theinference rules that deal with node sharing expressions.

The next theorem states that the inference rules correctlyand completely characterize the derivation of structuralrelationships. Let Q be a query, and p be a structuralrelationship not in Q. A set of inference rules is sound ifwhenever p can be produced from q using the inference rules,p can also be derived from q. It is complete if whenever Q hasat least one solution and p can be derived from Q, p appearsin Q or can be produced from q using the inference rules.

Theorem 3.1. The set of inference rules of Figs. 8 and 9 is soundand complete.

Soundness is straightforward. In order to prove com-pleteness, we use the concept of weighted transition whichis a sequence of edges in a partial path without backwarddouble edges and we show completeness by reasoning onthe existence of weighted transitions whose weights satisfyspecific conditions.

Clearly, the number of structural relationships in theclosure of a query is, in the worst case, a square polynomialin the number of its nodes. In practice, only a smallpercentage of these relationships appears in the closure ofthe query. Since usually a query is much smaller than thesize of the XML tree, the cost of computing its closure isinsignificant.

3.2 Query Satisfiability

Detecting an unsatisfiable query saves execution time at asmall overhead. It prevents accessing the data to get anempty answer.

Definition 3.2 (Satisfiable query). A query is called satisfi-able iff it has a nonempty answer on some XML tree.Otherwise, it is called unsatisfiable.

In contrast to tree-pattern queries, PTPQs and even PPQscan be unsatisfiable. Consider, for instance, the query Q5 ofFig. 5e. This query is unsatisfiable, since it cannot beembedded to any XML tree path. Indeed, the image of nodeB1 needs to lie between the images ofR andE1. But, since it islabeled by B, it cannot coincide neither with the image of A1

nor with the image ofC1. The following proposition providesnecessary and sufficient conditions for query satisfiability:

Theorem 3.2. A PTPQ is unsatisfiable iff its full form comprisesa trivial cycle in a partial path p, i.e., two structuralrelationships of the form Xi½p�==Yj½p� and Yj½p�==Xi½p�.

The if part is straightforward. For the only if part, weassume that the full form of the PTPQ q does not comprise atrivial cycle and we show that q is satisfiable by constructing

2248 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

Fig. 8. Inference rules for precedence relationships in a single partialpath.

Fig. 9. Inference rules that involve node sharing expressions.

an XML tree that satisfies q. The XML tree is constructed by1) applying inference rules involving node sharing expres-sions as soon as possible, 2) merging nodes participating innode sharing expressions, and 3) replacing double edges bysingle edges and applying partial path inference rules in atop-down way.

Consider query Q5 of Fig. 5e. This query is unsatisfiable.One can see that its full form has many trivial cycles. Forinstance, one can see that using IR5 three times, therelationship B1==R can be inferred, which creates a trivialcycle with R==B1 in the single partial path of the query.

Checking query satisfiability amounts to checking thefull form of the query for trivial cycles in its partial paths.This is in the worst case a square polynomial in the numberof the query nodes. Given that the size of a query is notexpected to be comparable to the size of the XML database,the cost of checking query satisfiability is insignificant.

3.3 Annotated Graph Representation for PTPQs

For evaluation purposes, it is convenient to representPTPQs as node labeled annotated directed graphs. In orderto do so, a PTPQ Q is first put in full form. Let QG denotethe annotated graph representation of Q. Every node X in Qcorresponds to a node XG in QG, and vice versa. Node XG islabeled by the label of X. Two nodes in Q participating in anode sharing expression correspond to the same node inQG. Otherwise, they correspond to distinct nodes in QG. Forevery structural relationship X=Y (resp. X==Y ) in Q, thereis a single (resp. double) edge in QG. In addition, each nodein QG is annotated by the set of PPs of the nodes in Q itcorresponds to. Note that these annotations allow us toexpress same-path constraints. That is, all the nodesannotated by the same partial path have to be embeddedto nodes in an XML tree that lie on the same path.

Fig. 10 shows the query graph for the PTPQQ of Fig. 6. Theannotations of the nodes are shown next to them betweensquare brackets. For instance, node E is annotated by PP p2,while nodeC is annotated by PPs p2 and p3. Note also that dueto the inference rules of Fig. 9, a node in the annotated graphinherits all the annotating PPs of its descendant nodes.Because of this inheritance property, we can omit in thefigures the annotation of internal nodes in annotated querygraphs when no ambiguity arises. For example, in theannotated graph of Fig. 10, node A is annotated by the PPsp1, p2, and p3 inherited from its descendant nodesD,E, andF ,respectively. Because these are the only PPs that annotatenode A, its annotation can be omitted.

For query evaluation purposes also, it is convenient tointroduce a canonical form for queries.

Definition 3.3. A descendant relationship Ai==Bj in a query iscalled transitive if there is a simple directed path from Ai to Bj

(other than Ai==Bj). This path can contain child anddescendant relationships. A PTPQ Q is in canonical formiff it contains exactly all the structural relationships of itsclosure except transitive relationships.

Since a satisfiable query does not comprise cycles in itsfull form, it has a unique canonical form. This canonicalform can be represented as an annotated rooted directedacyclic graph (dag). Computing the canonical form from thefull form of a query can be done efficiently by removing alltransitive edges. In the following, we assume that PTPQsare satisfiable and in canonical form, and we identifyPTPQs with their annotated dag representation.

4 PTPQ EVALUATION BY DECOMPOSITION

As mentioned earlier, no previous algorithms exist in theinverted lists model for evaluating XML queries asexpressive as PTPQs. In this section, we aim at leveragingexisting efficient algorithms for more restricted classes ofqueries. In order to do so, we outline two approaches thatevaluate PTPQs by decomposing them into multiple queriesof simpler query classes for which efficient evaluationalgorithms exist. We experimentally compare these algo-rithms in Section 7 with a holistic algorithm we design forPTPQs in the next sections.

4.1 Evaluating a PTPQ through an Equivalent Set ofTPQs

The first approach is based on Proposition 2.2. Given aPTPQ Q, this approach: 1) generates a set of TPQs which isequivalent to Q, 2) uses a state-of-the-art algorithm [9] toevaluate them, and 3) unions the results to produce theanswer of Q. As mentioned in Section 2.3, the number ofTPQs that need to be evaluated can be exponential on thenumber of nodes of the PTPQ. Therefore, the performanceof such an algorithm is not expected to be satisfactory onPTPQs with several TPQs.

In order to restrict the number of TPQs that need to beevaluated, we can take advantage of a structural summaryfor the XML data, for instance, an index graph. Given apartitioning of the nodes of an XML tree T , an index graphfor T is a graph G such that: 1) every node in G is associatedwith a distinct equivalence class of element nodes in T , and2) there is an edge in G from the node associated with theequivalence class A to the node associated with theequivalence class B, iff there is an edge in T from a node inA to a node in B. The equivalence class of nodes in Tassociated with each node inG is called extent of this node. A1-index [35], [29] considers as equivalent nodes in T that havethe same incoming path from the root ofT . A 1-index is a tree.We define the index tree of T to be a 1-index of T withoutextents. The index tree can be built by a single depth-firsttraversal of T in time proportional to the size of T .

1-indexes are usually much smaller than the correspond-ing XML data. According to the measurements of Arion et al.[5] on XML documents from different repositories, a 1-indexis three to five orders of magnitude smaller than thecorresponding XML data. Since index trees do not have

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2249

Fig. 10. (a) PTPQ Q of Fig. 6. (b) Annotated dag representation of Q.

extents, their size is insignificant compared to the size ofthe XML data.

Given a queryQ and an index tree I, we can generate a setT of TPQs that is equivalent to Q by finding all theembeddings of Q into I. The images of the nodes of Q underan embedding define a TPQ. Two consecutive image nodeson a path of the index tree are linked in the TPQ through achild relationship if they are linked in the index tree througha child relationship. Otherwise, they are linked in the TPQthrough a descendant relationship. Any of the algorithmspresented in this paper can be used to find the embeddings ofa PTPQ to an index tree. However, even a naive approachwould be satisfactory given the size of an index tree.

Fig. 11 shows an index tree and the single embedding ofquery Q of Fig. 6 to it which determines a correspondingTPQ. Contrast this TPQ with the two TPQs of Fig. 7 whichtogether are equivalent to Q. The use of the index tree notonly filters out one of the two TPQs for Q but it also refinesthe remaining one by replacing some descendant edges bychild edges.

The next proposition shows that the answer of a PTPQcan be correctly computed by the TPQs generated using anindex tree. Its proof is straightforward.

Proposition 4.1. Let T be an XML tree and I be its index tree.Let also Q be a PTPQ and T ¼ fT1; . . . ; Tng be the set ofTPQs generated for Q on I. Then, the answer of Q on T is theunion of the answers of all the Tis on T .

In practice, the number of the TPQs for a PTPQ Q isexpected to be small. However, one can think of caseswhere it can still be exponential on the number of nodes inQ. Nevertheless, even in this case, any one of the TPQsgenerated represents a pattern that occurs in T . Therefore, itwill return a nonempty answer when evaluated on T . Thatis, the number of TPQs is bound by the number of solutionsof Q. We call this approach IndexTPQGen.

4.2 Evaluating a PTPQ by Decomposing It intoPPQs

The second approach, called PartialPathJoin, is based ondecomposing the given PTPQ into a set of partial pathqueries corresponding to the partial paths of the PTPQ. Forinstance, for the PTPQ Q of Fig. 6, the PPQs correspondingto the partial paths p1, p2, and p3 of Fig. 6 are produced.These PPQs are linked among them through node sharingexpressions. Given a PTPQ Q, PartialPathJoin: 1) uses thestate-of-the-art algorithm [47] to evaluate the correspondingpartial path queries, and 2) merge-joins the solutions on

their common nodes (nodes participating in the nodesharing expressions) to produce the answer of the PTPQ.

5 DATA STRUCTURES FOR PTPQ EVALUATION

We present in this section the data structures and operationswe use for PTPQ evaluation in the inverted lists model.

Query functions. Let Q be a query, X be a node in Q,Boolean function isSinkðXÞ returns true iff X is a sink nodein Q (i.e., it does not have outgoing edges in Q). Functionparents(X) returns the set of parent nodes of X in Q.

Operations on inverted lists. With every query node Xin Q, we associate an inverted list TX of the positionalrepresentation of the nodes labeled by x in the XML tree.The nodes in TX are ordered by their start field (seeSection 2.2). To access sequentially the nodes in TX, wemaintain a cursor. We use CX to denote the node currentlypointed to by the cursor in TX. Operation advance(X)moves the cursor to the next node in TX. Function eos(X)returns true if the cursor has reached the end of TX.

Stacks. With every query node X in Q, we associate astack SX. An entry e in stack SX corresponds to a node in TXand has the following two fields:

1. A field consisting of the triplet ðstart; end; levelÞwhich is the positional representation of the corre-sponding node in TX.

2. A field ptrs which is an array of pointers indexed byparentsðXÞ. Given P 2 parentsðXÞ, ptrs½P � points tothe highest among the entries in stack SP thatcorrespond to ancestors of e in the XML tree.

Stack operations. We use the following stack operations:push(SX , entry) which pushes entry on the stack SX, andbottom(SX) which returns the bottom entry of stack SX.Boolean function empty(SX) returns true iff SX is empty.

Initially, all stacks are empty, and for every query nodeX, its cursor points to the first node in TX. At any pointduring the execution of the algorithm, the entries that stackSX can contain correspond to nodes in TX before the cursorCX. The entries in a stack below an entry e are ancestors of ein the XML tree. Stack entries form partial solutions of thequery that can be extended to become solutions as thealgorithm goes on.

Query matches. Given a query node X in Q, let QancðXÞdenote the subgraph of Q consisting of X and all theancestors of X in Q, and QdesðXÞ denote the subgraph of Qrooted at X. The nodes in Q annotated by pi which areneither descendants nor ancestors of X in Q are calledsibling nodes of X in pi. QsibðX;piÞ denotes the subdag of Gwhich comprises R, X and all the sibling nodes of X in pi.QdesðX;piÞ denotes the subgraph of Q rooted at X consistingof all the descendant nodes of X annotated by pi.

As an example, consider the PTPQ Q and the XML tree Tshown in Figs. 12b and 12a, respectively. Query Q consistsof two partial paths p1 and p2, where the nodes in p1 and p2

are fR;A;B;D;C;Eg and fR;A;B;D;G; Fg, respectively.Given node D in Q, the nodes of QancðDÞ are R, A, B and D,and the nodes of QdesðDÞ are D, E, and F . Node C(respectively, G) is the only sibling node of D in partialpath p1 (respectively, p2). The paths D==E and D==F formQdesðD;p1Þ and QdesðD;p2Þ, respectively.

2250 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

1

1

1

2 3

2 2

1 2 3

Fig. 11. An embedding of queryQ of Fig. 6 to an index tree that defines acorresponding TPQ.

Let x be a node in the inverted list TX of X. Node x iscalled an ancestor match of X, if there is an embedding ofthe dag QancðXÞ to a path in the XML tree T such that Xis mapped to x. The path formed by the images of thenodes of QancðXÞ (excluding X) under such an embeddingis called an ancestor path of x w.r.t. Q. Node x is called adescendant match of X, if there is an embedding of the dagQdecðXÞ to the XML tree T which maps X to x. Node x iscalled a sibling match of X for pi, if there is an embeddingof QsibðX;piÞ to a path in the XML tree T such that X ismapped to x. Node x is called a candidate match of X if thefollowing two conditions are satisfied: 1) x is a descendantmatch of X, and 2) x is a sibling match of X for everypartial path annotating X in Q.

Continuing with the previous example, node d1 is anancestor match of D, since there is an embedding of QancðDÞto T which maps D to d1 (and nodes R, A, B to r, a1, b1,respectively). Path r=a1=b1 is an ancestor path of x. Node d1

is also a descendant match of D, since there is anembedding of QdesðDÞ to T which maps D to d1 (and nodesE and F to e1 and f1, respectively). Node d1 is a siblingmatch of D for p1, since there is an embedding which mapsD to d1 (and C to c1). Similarly, node d1 is a sibling match ofD for p2. Therefore, d1 is a candidate match of D.

Proposition 5.1. LetX be a node in a queryQ, andx be a candidatematch of X in an XML tree T such that x is an ancestor of thematches of the descendant nodes and sibling nodes of X in Q.Nodex is in a solution ofQ onT , iff there is an ancestor path forxin T such that every node on it is in a solution of Q on T .

In our running example, by Proposition 5.1, node d1 is ina solution of Q, since d1 is a candidate match of Q and everynode in the ancestor path r=a1=b1 of x is in a solution.

6 A HOLISTIC PTPQ EVALUATION ALGORITHM

The flexibility of the PTPQ language in specifying queriesand its increased expressive power makes the design of anevaluation algorithm challenging. Three outstanding rea-sons of additional difficulty are: 1) a query is a dag (whichin the general case is not merely a tree) augmented withconstraints, 2) the same-path constraints should be enforcedfor all the nodes in a partial path in addition to enforcingstructural relationships, and 3) precedence relationships ina partial path do not necessarily determine a total order forthe nodes of the partial path. In this section, we present ourholistic evaluation algorithm PartialTreeStack, whichefficiently resolves these issues. The holistic algorithmpresented in this paper improves the algorithm suggested

in [48] for PTPQs and addresses some issue that algorithmhad with the checking of the same-path constraint.

6.1 Algorithm PartialTreeStack

6.1.1 Overview

Algorithm PartialTreeStack (Listing 1) operates in twophases. In the first phase, it iteratively calls functiongetNextCursor to select a node from the inverted lists. Itstores selected nodes to stacks and calls procedure output�PPSolutions to compute the solutions of individual partialpaths of the query in this same phase. The details ofoutputPPSolutions can be found in the presentation ofalgorithm PartialPathStack [42]. In the second phase,Procedure mergeAllPPSolutions is called to sort the partialpath solutions (in a topological order of query nodes) usinga sort algorithm and then to merge-join all the sortedsolutions in order to form the answer of the query. Thedetails are omitted here in the interest of space.

Listing 1. Algorithm PartialTreeStack

1 while (:end()) do

2 X getNextCursor()

3 if (X 6¼ null) then

4 cleanStacks(parents(X) [fXg , CX)

5 push(SX , (CX, pointers to the top entry of everyparent stack of X))

6 for (every partial path pi annotating X in Q) do

7 if (every node Y annotated by pi in Q has

non-empty stack SY )) then

8 outputPPSolutions(pi, X) { Ref. [42], [47]}

9 advance(X)

10 knownSoln½Z� false

11 mergeAllPPSolutions()Function end()

1 return 8 node X 2 Q: isSink(X) ) eos(X)

Procedure cleanStacks(nodes, n)

1 for (Y 2 nodes) do

2 pop entries that are not ancestors of n from the stacks

of nodes

Below, we focus only on getNextCursor, which is thecore function of PartialTreeStack. For convenience ofpresentation, we assume that an input PTPQ Q involvesonly descendant relationships. PartialTreeStack worksequally well on PTPQs with child relationships except thatit does not guarantee optimality. In this sense, it is similar toalgorithm TwigStack [9] on TPQs. We call a stack-basedalgorithm on inverted lists asymptotically optimal if itsworst case time complexity is linear on the size of the inputinverted lists and the size of the query answer.

6.1.2 Function getNextCursor

Function getNextCursor is shown in Listing 2. The mainactions of getNextCursor are: 1) discarding useless invertedlist nodes, 2) discovering nodes in the remaining invertedlists that are in a solution of the input query Q, and3) returning to the main algorithm a node which currentlyhas the minimal start value among all the nodes that havebeen discovered to be in a solution of Q. FunctiongetNextCursor processes query cursors in ascending orderof their start value (line 1). In this way, algorithmPartialTreeStack ensures that nodes in a solution are

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2251

(a) (b) (c)

Fig. 12. (a) An XML tree T . (b) Query Q. (c) The answer of Q on T .

stored in stacks in the order of their start value (i.e.,according to a preorder traversal of the XML tree). Due tothe structural complexity of PTPQs, it is important to keepsuch an order to avoid erroneously popping out nodes fromtheir stacks before all the solutions they participate in aregenerated.

Listing 2. Function getNextCursor()

1 X getNextMin(R)

2 if (knownSoln½X�) then

3 return X

4 Y X

5 while (:satSamePath(Y )) do

6 if (9 node Z in Q s.t. knownSoln½Z� is true) then

7 return such a node whose cursor has the minstart value

8 if (stack SY is empty or bottom(SY ) is not an ancestor

of CY ) then

9 let Z be the node among the descendant nodes

and sibling nodes of X whose cursor has the

minimal end value

10 advance (Z)

11 return null

12 else

13 Y the next node among the descendant nodes

and sibling nodes of X whose cursor has the

minimal start value

14 if (hasAncPath(X)) then

15 for (every node Y in QdesðXÞ) do

16 knownSoln½Y � true

17 return X

18 else

19 advance(X)

20 return null

Function hasAncPath(X)

1 if ((Y ¼ R) or (8P 2 parents(X): : empty(SP ) and

bottom(SP ) is an ancestor of CX)) then

2 return true

3 else

4 return false

Function getNextCursor first filter candidate matches andthen ancestor path matches. Because of Proposition 5.1, thestored nodes are guaranteed to be solution nodes. FunctiongetNextCursor avoids processing nodes in the inverted liststhat are guaranteed not to be part of any solution of the queryby advancing the corresponding cursors. This happens whena structural constraint of the dag or a same-path constraint isviolated by these nodes. We describe below these features ofgetNextCursor in more detail.

Dealing with the query dag. Since Q is a dag and thenodes in a partial path of Q do not necessarily form a totalorder, the order with which cursor nodes are discovered tobe in a solution of Q is not necessarily in accordance withtheir preorder appearance in the XML tree. That is, cursornodes that are found to be in a solution by a previous call ofgetNextCursor may have a larger start value than thosediscovered by the current call of getNextCursor. To preventredundant computations, a Boolean array, called knownSoln,is used. Array knownSoln is indexed by the nodes ofQ. Givena node X of Q, if knownSoln½X� is true, getNextCursor has

already discovered that the cursor nodeCX is in a solution ofQ. When the getNextCursor encounters a query node X forwhich knownSoln½X� is true, it returns CX without anyfurther processing (lines 2-3). knownSoln½X� is reset to falseafter cursor CX is advanced (line 10 in the main algorithm).

Function getNextCursor first calls function getNextMinto identify a node X in Q where the structural relationshipsin the subdag QdesðXÞ are satisfied by the current querycursor nodes and cursor CX has the minimal start valueamong all such nodes. Function getNextMin is similar tofunction getNext which is the core function of algorithmTwigStack [9] dealing with structural relationships in aTPQ. Function getNext processes Q bottom-up to finda query node X that satisfies the following: 1) CX has adescendant node on each of the inverted lists correspond-ing to the child nodes of X in Q, and 2) each of thedescendant nodes of X recursively satisfies this property.For this reason, when all edges in Q are descendantrelationships, each node pushed onto its stack is guaran-teed to participate in a final solution of Q. FunctiongetNextMin extends getNext to handle structural relation-ships not only in a tree but also in a dag. In addition, itmakes sure that the nodes returned are in the preorder oftheir appearance in the XML tree.

Dealing with the same-path constraint. FunctiongetNextCursor iteratively calls function satSamePath(shown in Listing 3) to identify in a top-down order the firstnode Y among the descendant nodes and sibling nodes of Xwhose cursor is known in a solution or satisfies the same-path constraint (lines 5-13). The checking of the same-pathconstraints is iteratively conducted for each partial path piannotating X in Q.

Listing 3. Function satSamePath(X)

1 for (every partial path pi annotating X in Q) do

2 let desNodes denote the set of nodes of QdesðX;piÞ3 desMatches fCY g, for every Y 2 desNodes4 let sibNodes denote the sibling nodes of X in pi { Ref.

Section V}

5 let nodesOutStack denote all the nodes in sibNodes

whose stacks do not contain entries that ancestors

of CX6 for (every Y 2 nodesOutStack whose knownSoln½Y �

is false) do

7 LocateExtension(Y ) { Ref. [27]}

8 matchesInStack fbottomðSY Þg, for every Y 2ðsibNodes� nodesOutStackÞ

9 matchesOutStack fCY g, for every Y 2nodesOutStack

10 if (:onSamePath(desMatches [matchesInStack [matchesOutStack)) then

11 return false

12 return true

Function onSamePath(matches)

1 minEnd minm2matchesfm:endg2 maxStart maxm2matchesfm:startg3 return (maxStart � minEnd)

In each iteration, function satSamePath checks whetherthe cursors of the nodes of QdesðX;piÞ are on the same pathwith cursor nodes or stack entries of sibling nodes of Y in pi

2252 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

(lines 2-10). For each sibling node Y that is not knowyet to be in a solution, satSamePath calls procedureLocateExtension [27] to locate the first nodes in the invertedlist of Y and of its descendants (lines 6-7) that satisfythe precedence relationships. Note that ProcedureLocateExtension deals only with the structural relation-ships involving its input query node. The reason for callingLocateExtension is to make sure that the cursors of thosequery nodes satisfy the precedence relationships imposedon the query nodes before checking whether they satisfy thesame-path constraints.

Function onSamePath checks whether nodes lie on thesame path using Proposition 2.1.

Finding the ancestor path. In order to check whethercursor CY has an ancestor path it suffices to check whetherfor every parent P of Y , the stack SP contains an entrywhich is the ancestor of CY in the XML tree (line 14 ofgetNextCursor). If this is the case, CY is in a solution of Qaccording to Proposition 5.1. Otherwise, CX is not in asolution and it is discarded (line 15).

A step-by-step example of the execution of the algorithmis presented in the appendix, available in the onlinesupplemental material.

6.2 Analysis of PartialTreeStack

Correctness. Assuming that all the structural relationshipsin a PTPQ Q are regarded as descendant, whenever a nodeX is returned by getNextCursor, it is guaranteed that CX isin a solution of Q. Moreover, every node in a solution willbe discovered by getNextCursor and only useless nodeswill be discarded. Further, getNextCursor always returnscursors in a solution in their preorder appearance in theXML tree. In the main part of PartialTreeStack, each cursorCX returned by getNextCursor is pushed onto stack SX inthe order it is returned. Finally, whenever CX is popped outof its stack, all the solutions involving CX have beenproduced. Based on these observations, we can show thefollowing theorem.

Theorem 6.1. Given a PTPQ Q and an XML tree T , algorithmPartialTreeStack correctly computes the answer of Q on T .

Complexity. Given a PTPQ Q and an XML tree T , let jQjdenote the size of the query dag, N denote the number ofquery nodes of Q, P denote the number of partial paths ofQ, IN denote the total size of the input inverted lists, andOUT denote the size of the answer of Q on T . In [6], therecursion depth of X of Q in T is defined as the maximumnumber of nodes in a path of T that are images of X underan embedding of the subdag QancðXÞ to T . We define therecursion depth of Q in T , denoted by D, as the maximumof the recursion depths of the query nodes of Q in T .

Theorem 6.2. The space usage of Algorithm PartialTreeStackis OðjQj �DÞ.

The proof follows from the fact that: 1) the number ofentries in each stack at any time is bounded by D, and 2) foreach stack entry, the size of ptrs is bounded by the out-degree of the corresponding query node.

As with TwigStack [9] on TPQs, when Q has no childstructural relationships, algorithmPartialTreeStack ensures

that each solution produced for a partial path is guaranteedto participate in the answer of Q. Therefore, no intermediatesolutions are produced. Consequently, the CPU time ofPartialTreeStack is independent of the size of solutions ofany partial path in a descendant-only PTPQ query.

The CPU time of PartialTreeStack consists of two parts:one for processing input inverted lists, and another forproducing the query answer. Since each node in an invertedlist is accessed only once, the CPU time for processing theinput is calculated by bounding the time interval betweentwo consecutive cursor movements. The time interval isdominated by the invocation of function satSamePath forevery query node X and is OðN2 � P Þ. The CPU time ongenerating partial path solutions and merge-joining them toproduce the query answer is OððIN þOUT Þ �NÞ.Theorem 6.3. Given a PTPQ Q without child structural

relationships and an XML tree T , the CPU time of algorithm

PartialTreeStack is OððIN �N � P þOUT Þ �NÞÞ.

Clearly, if the size of the query is insignificant comparedto the size of data, PartialTreeStack is asymptoticallyoptimal for queries without child structural relationships.

7 EXPERIMENTAL EVALUATION

We ran a comprehensive set of experiments to comparePartialTreeStack with IndexTPQGen and PartialPathJoin. Wealso compared PartialTreeStack with a state-of-the-artXQuery engine. In this section, we report on this experi-mental evaluation.

7.1 Comparison of the Algorithms

Setup. We implemented all algorithms in C++, and ran theexperiments on a 3 GHz Intel Core 2 Duo machine with 3GB RAM, running Debian Linux 5.0. For the experiments,we used both a real and a synthetic data sets. As a real dataset, we used the Treebank XML document.1 This data setconsists of around 2.5 million nodes (82 MB) and itsmaximum depth is 36. It includes deep recursive structures.The synthetic data set is a set of random XML treesgenerated by IBM’s XML Generator2 and by constructionincludes highly recursive structures. Initially, we used adata set of maximum depth 16 consisting of 1.5 millionnodes (54 MB). For each measurement on the synthetic dataset, 10 different XML trees were used. Each value displayedin the plots is averaged over these 10 measurements.

On each of the two data sets, we tested the seven PTPQsshown in Fig. 14. Our query set comprises a full spectrum ofPTPQs, from a simple TPQ to partial path queries tocomplex dags comprising both descendant and childprecedence relationships. The query labels are appropri-ately selected for the Treebank data set, so that they can allproduce results. Thus, node labels R, A, B, C, D, E, F , andG correspond to FILE, EMPTY , S, V P , SBAR, PP , NP ,and PRP , respectively, on Treebank.

Query execution time. We measured the execution timeof IndexTPQGen, PartialPathJoin, and PartialTreeStack for

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2253

1. www.cis.upenn.edu/~treebank.2. www.alphaworks.ibm.com/tech/xmlgenerator.

evaluating the queries of Fig. 14 over the two data sets.Figs. 13a and 13b present the evaluation results.

The time of IndexTPQGen consists of the time for findingthe TPQs generated by embedding the PTPQ in the index treeand the time for evaluating these TPQs on the data set.IndexTPQGen has the best performance for Q1 on both datasets. This is expected since Q1 is a TPQ. However, on PTPQswhich are not TPQs, its performance is unstable. It usuallydisplays the worst performance unless the TPQ-generationprocess results in a small number of TPQs in which case it caneven outperform the other two algorithms (e.g., queries Q6

andQ7 on Treebank where only one TPQ is generated for eachquery). Its worst performance compared to the other twooccurs on PTPQs which generate a large number of TPQs onthe index tree of the data set (e.g., Q2 on the synthetic dataset). PartialPathJoin shows the best performance for queriesQ2 and Q3. This again is expected since these queries arePPQs. However, for PTPQs which are not PPQs, it competeswith IndexTPQGen for the worst performance. The reason isthat PartialPathJoin waists time finding intermediate solu-tions. For example, when evaluating Q1 on Treebank,PartialPathJoin shows the worst performance (Fig. 14), dueto the large amount of intermediate solutions generated.

PartialTreeStack has the best time performance in almostall cases of “pure” PTPQs (PTPQs which are not TPQs orPPQs). It is only dominated by IndexTPQGen, when thelatter one happens to need a very restricted number of TPQsto evaluate the query. Its performance is stable, and doesnot degrade on more complex queries and on data withhighly recursive structures.

Execution time varying the input size. We compare theexecution time of the three algorithms as the size ofthe input data set increases. Figs. 15a and 15b report on theexecution time of the algorithms increasing the size of thesynthetic data set for queries Q4 and Q5, respectively.

PartialTreeStack consistently has the best performance.When the input size goes up, the execution time of thealgorithms increases. This confirms the complexity resultsthat show dependency of the execution time on the inputand output size. However, the increase in the executiontime of PartialPathJoin and to a lesser extent of IndexTPQGenis sharper than that of PartialTreeStack. The reason is thatPartialPathJoin is affected by the increase in the number ofthe intermediate solutions, while the performance ofIndexTPQGen is affected by the evaluation of the multipleTPQs generated from the input queries.

Execution time varying the input depth. We alsocompare the execution time of the three algorithms as themaximum depth of the input data set increases. Figs. 16aand 16b report on the execution time of the algorithmsincreasing the input depth of the synthetic data set (its sizeis fixed to 1.5 million nodes) for queries Q4 and Q5,respectively. In all the cases, PartialTreeStack outperformsthe other two algorithms. We observe that as the inputdepth increases, the execution time of PartialTreeStackincreases slowly. In contrast, the increase of the executiontime of IndexTPQGen is sharper than that of PartialTreeStack.The same holds for PartialPathJoin on Q4 while PartialTree-Stack and IndexTPQGen on Q5 display the same increaserate. This behavior of PartialPathJoin is due to the fact thatan increase in the input depth of the data set entails anincreased number of intermediate solutions resulting fromthe proliferation of embeddings of the partial path queriesto the paths of the data set. IndexTPQGen does not generateintermediate solutions. The negative effect of the increase ofthe input depth is due to the fact that the number of TPQsthat are used for evaluating the PTPQ can increase when theinput depth is increased (and the corresponding index treebecomes larger).

7.2 Comparison of PartialTreeStack with an XQueryEngine

We compare our algorithm PartialTreeStack against a state-of-the-art XQuery database system. We chose for this compar-ison MonetDB/XQuery which is shown to outperform otherXQuery systems on benchmark XML data [8]. MonetDB/XQuery is a relational main-memory XQuery system. Theversion of MonetDB/XQuery installed was v4.36.5.

For the comparison, we used the PTPQs Q1-Q7 shown inFig. 14 on the Treebank data set. In order to execute thesequeries on MonetDB/XQuery, we wrote correspondingXQuery queries whose answers are sets of tuples of XMLnodes indexed by the nodes of the query. PartialTreeStackreturns tuples ordered on a topological ordering of the

2254 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

Fig. 14. Queries used in the experiments.

0

25

50

75

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Exe

cutio

n tim

e (s

ec)

IndexTPQGenPartialPathJoin

PartialTreeStack

(a) Execution time (Treebank)

0

25

50

75

100

125

150

175

200

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Exe

cutio

n tim

e (s

ec)

IndexTPQGenPartialPathJoin

PartialTreeStack

(b) Execution time (Synthetic data)

Fig. 13. Evaluation of PTPQs on the two data sets.

query nodes and on the preorder of XML tree nodes. Inorder to enforce this order, we assigned an attribure ID toeach element in the Treebank data set that records thepreorder position of that element in the XML tree, and usedit to order the solutions of the XQuery queries.

Typical PTPQs can be formulated in XQuery using aWHERE clause comprising one or more “joins” expressedby the “is” operator. The “is” operator forces the operandquery nodes to be matched to the same XML tree node. Itcan be used to enforce the same-path constraint for thenodes of a partial path of the query. Depending on the case(e.g., for queries Q2, Q3, and Q4), a new wildcard nodemight need to be inserted in the XQuery expression incombination with descendant-or-self or ancestor-or-selfaxes for the same constraint to be enforced. Fig. 17 showsthe XQuery version of Q5 of Fig. 17.

Fig. 18 reports on the time results of the execution of thequeries using PartialTreeStack and MonetDB/XQuery. Inboth cases, the answers are serialized but not displayed.Algorithm PartialTreeStack is not designed to exploit cachedresults, and for fair comparison, we ran MonetDB/XQuerywith a cold cache. As we can see, in some cases MonetDBfails to evaluate the query and reports out of memory errors(queries Q2, Q4, and Q7). In contrast, PartialTreeStack did not

encounter any problem to run any of the queries examined.

This is not a surprise given the worst case analytical space

complexity result presented by Theorem 6.2. In fact,

PartialTreeStack (as all the holistic algorithms in the inverted

lists evaluation model) is designed for non in-memory

evaluation of queries on large persistent XML documents.

Algorithm PartialTreeStack was able to evaluate all the

queries. With the exception of query Q1 (which is a TPQ), it

outperformed MonetDB on all PTPQs. On all the queries

MonetDB/XQuery was able to evaluate, PartialTreeStack

improved MonetDB/XQuery by a factor greater than 2.5.

Note that additional experiments (not shown here) showed

that similar evaluation times were obtained when the

queries were run on MonetDB without ordering the results

(that is when the queries were formulated without the

ORDER BY clause).It is worth noting that these results were obtained without

employing any optimization techniques for PartialTreeStack.

Different indexing (based on B+ trees [14], [9], [26], [27], [31],

[36] or structural summaries [30], [7], [5], [36]) and view

materialization [49], [12] techniques for TPQs have been

presented aiming at filtering out nodes from the inverted

lists that are known, in advance, not to contribute to the

answer of the query. Extending these techniques for PTPQs

can further improve the performance of PartialTreeStack.

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2255

Fig. 17. The XQuery version of query Q5 of Fig. 14.

Fig. 18. Execution time (in seconds) of PartialTreeStack and MonetDBon XQuery queries corresponding to the queries of Fig. 14 on a 82 MBTreebank document.

Fig. 15. Evaluation of queries on synthetic data with increasing size.

Fig. 16. Evaluation of queries on synthetic data with increasing depth.

8 RELATED WORK

In this paper, we assume that queries are evaluated in theinverted lists evaluation model. This evaluation model usesinverted lists built over the input data to avoid: 1) preloadingXML documents in memory, and 2) processing largeportions of the XML documents that are not relevant to thequery evaluation. Because of these advantages, many XMLquery evaluation and optimization algorithms have beendeveloped for this model.

Inverted-lists-based evaluation techniques. The evalua-tion algorithms in the inverted lists model broadly fall intwo categories: the structural join approach [53], [2], [50],[14], [26], and the holistic twig join approach [9], [33], [13],[27], [25], [10]. Comparison studies on query evaluationtechniques [36], [22] show that holistic stack-based algo-rithms in the inverted lists model are superior to otheralgorithms and evaluation models (streaming/navigational[39] or sequential/string matching approaches [41]).

The structural join approach first decomposes a TPQ intoa set of binary relationships. Then, it evaluates the relation-ships using binary merge join. This approach might not beefficient because it generates a large number of intermediatesolutions (that is, solutions for the binary relationships thatdo not contribute to the TPQ answer).

The holistic twig join approach (e.g., TwigStack [9])represents the state of the art for evaluating TPQs. Thisapproach evaluates TPQs by joining multiple input lists at atime to avoid producing large intermediate solutions.Algorithm TwigStack is shown optimal for TPQs withoutchild relationships.

Several papers focused on extending TwigStack. Forexample, in [33], algorithm TwigStackList evaluates effi-ciently TPQs in the presence of child relationships. Algo-rithm iTwigJoin extended TwigStack by utilizing structuralindexes built on the input lists [13]. Chen et al. [10] proposedalgorithms that handle TPQs over graph structured data.Evaluation methods of TPQs with OR predicates weredeveloped in [25]. Algorithm TwigOptimal [17] applies thenotion of virtual cursors [51] to enhance the traversal ofthe XML tree during tree-pattern query evaluation. Algo-rithm Twig2Stack was presented in [11] to avoid merge-joining path solutions needed by TwigStack.

All the aforementioned query evaluation techniquesassume that there are access mechanisms, i.e., indexes, thatefficiently return a stream of nodes in the XML tree thatsatisfy a given node predicate. Nodes within streams areusually represented by their region encoding positionalrepresentation [53] (see Section 2.1). Other streamingschemes, e.g., Tag+Level Streaming and Prefix-Path Stream-ing, are suggested in [13]. In [34], instead of the regionencoding positional representation, the authors used anextended Dewey labeling scheme to facilitate query evalua-tion. In our approach, we assume the region encodingscheme of Zhang et al. [53].

The above algorithms are developed for TPQs andcannot be used or extended so that they evaluate PTPQs.The reason is that PTPQs are not mere tree patterns butdags annotated with the same-path constraints.

Optimization techniques for the inverted lists ap-proach. Approaches that speed up the processing of theoriginal holistic evaluation algorithm TwigStack [9] by

skipping unnecessary nodes build indexes on the inputinverted lists to define node clusterings and/or orderings.They can be classified into the following two categories.

The first category comprises approaches built upon theconventional Bþ-tree technique. It includes the Bþ-tree [14],the XB-tree [9], and the XR-tree [26], [27]. A study in [31]compares the performance of the three Bþ-tree-basedtechniques on evaluating binary query patterns. Thisperformance comparison is extended in [36] to TPQs.

The second category consists of solutions which combinestructural indexes with inverted lists to support XML queryevaluation [30], [7], [5], [36]. By partitioning the input XMLdata nodes according to their structural properties, the sizeof the resulting structural index is usually smaller than theoriginal XML data. Consequently, the query evaluationconducted directly on the structural index is expected to bemore efficient than on the input data itself.

Answering XML queries using materialized views in theinverted list model is studied in [49], [12], [40].

All these techniques are orthogonal to our approach. Ouralgorithms can be extended to take advantage of them. Suchan endeavor though is beyond the scope of this work.

Streaming evaluation. Considerable work has also beendone on the processing of XPath queries when the XMLdata are not encoded and indexed (main-memory evalua-tion or streaming evaluation). For example, Gottlob et al.[19] suggested polynomial main-memory algorithms foranswering full XPath queries. Streaming evaluation algo-rithms for different fragments of XPath representing mainlyTPQs were presented, among others, in [38], [21], [23]. Thestreaming evaluation, though a single choice for a numberof applications, cannot be compared in terms of perfor-mance to the inverted lists evaluation we adopt here. Thereason is that in the streaming evaluation, no indexes orinverted lists can be exploited and the whole XMLdocument has to be sequentially scanned.

Beyond TPQs evaluation. Several papers have ad-dressed the need to query XML data when the structure isnot fully known to the user, or to query (in an integratedway) XML data sources with different structures [28], [15],[24], [32], [43]. Query languages without any structure(keyword-based languages) have been studied in [16], [15],[24]. Li et al. [32] extend XQuery to enable users to queryXML documents without full knowledge of the structure.They employ the notion of Meaningful Lowest CommonAncestor Structure to define semantics for keyword queriespossibly with structural restrictions and to compute theiranswers. Flexible queries [28] relate to PTPQs, but theirevaluation is not always polynomial. Relaxing a query inorder to retrieve more results has been proposed in [3].However, the returned results are approximate with respectto the initial query.

A number of papers [20], [37] deal with fragments of

XPath which are more expressive than TPQs and include,

for instance, reverse axes. They show that queries in these

fragments can be rewritten equivalently as disjunctions of

tree-pattern queries or unions of acyclic conjunctive queries.

Nevertheless, the size of the resulting queries can be

exponential in the size of the initial query. The focus of

these papers is not on the evaluation of queries and they do

not provide any evaluation algorithms.

2256 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

In [46], algorithms for processing dag queries on graphstructured XML documents are presented. The algorithmsuse a special region encoding scheme for representing theXML data graphs. The presented algorithms extend structur-al joins techniques [2]. In order to process a dag query, thealgorithms first decompose the dag to a set of binary twigqueries. The presented techniques are not able to handlePTPQs, which are dags annotated with the same-pathconstraints. Further, in contrast to [46], our approach isholistic. The relational approach for computing XML queriesenables the use of existing facilities that have been developedfor relational RDBMS for the processing of XPath queries [18].

PTPQs were initially introduced in [43]. Their contain-ment problem was studied in [44] and PTPQ semanticissues were addressed in [45]. Relevant to our work are alsothe evaluation algorithms for partial path queries [42], [47].Partial path queries are not a subclass of TPQs but theyform a subclass of PTPQs. Preliminary results on evaluatingPTPQs on the inverted list model were presented in [48].

9 CONCLUSION

This paper is motivated by the gap in the efficient processingand evaluation of broad fragments of XPath that go beyondTPQs on large XML repositories. We considered PTPQs, alarge fragment of XPath that strictly contains TPQs. Becauseof their expressive power and flexibility, PTPQs are usefulfor querying XML documents whose structure is complex ornot fully known to the user, and for integrating XML datasources with different structures.

In order to process PTPQs, we introduced a set ofsound and complete inference rules to characterizestructural relationship derivation. We provided necessaryand sufficient conditions for detecting query unsatisfia-bility and node redundancy. We also showed that PTPQscan be represented as directed acyclic graphs annotatedwith the same-path constraints.

In order to evaluate PTPQs, we considered the invertedlists evaluation model, which is widely adopted forevaluating XML queries on large XML repositories. In thiscontext, we designed PartialTreeStack, an efficient stack-based holistic algorithm for PTPQs. To the best of ourknowledge, no previous algorithms exist in the inverted listmodel that can efficiently evaluate such a broad fragment ofXPath. Under the reasonable assumption that the size ofqueries is not significant compared to the size of data,PartialTreeStack is asymptotically optimal for PTPQswithout child structural relationships. Our experimentalresults show that PartialTreeStack can be used in practiceon a wide range of queries and on large data sets with deeprecursion. They also show that PartialTreeStack largelyoutperforms other decomposition-based approaches thatexploit existing techniques for more restricted classes ofqueries. Further, PartialTreeStack outperforms a state-of-the art XQuery engine on PTPQs.

Indexing techniques were shown to speed up substan-tially holistic stack-based algorithms on TPQs. An interest-ing research direction involves extending the algorithmspresented in this paper for PTPQs so that they can takeadvantage of these optimization techniques. Using materi-alized views to optimize our PTPQ evaluation algorithm isanother useful extension of the present work.

ACKNOWLEDGMENTS

The authors would like to thank Aggeliki Dimitriou for

multiple insightful suggestions and comments and for

helping us with the experiments in the latest version of

this paper.

REFERENCES

[1] World Wide Web Consortium site, W3C. http://www.w3.org/,2012.

[2] S. Al-Khalifa, H.V. Jagadish, J.M. Patel, Y. Wu, N. Koudas, andD. Srivastava, “Structural Joins: A Primitive for Efficient XMLQuery Pattern Matching,” Proc. Int’l Conf. Data Eng. (ICDE),2002.

[3] S. Amer-Yahia, S. Cho, and D. Srivastava, “Tree Pattern Relaxa-tion,” Proc. Int’l Conf. Extending Database Technology: Advances inDatabase Technology (EDBT), 2002.

[4] S. Amer-Yahia, L.V.S. Lakshmanan, and S. Pandit, “Flexpath:Flexible Structure and Full-Text Querying for XML,” Proc. ACMSIGMOD Int’l Conf. Management of Data (SIGMOD), 2004.

[5] A. Arion, A. Bonifati, I. Manolescu, and A. Pugliese, “PathSummaries and Path Partitioning in Modern XML Databases,”World Wide Web, vol. 11, pp. 117-151, 2008.

[6] Z. Bar-Yossef, M. Fontoura, and V. Josifovski, “On the MemoryRequirements of XPath Evaluation over XML Streams,” Proc.ACM SIGMOD-SIGACT-SIGART Symp. Principles of DatabaseSystems (PODS), pp. 177-188, 2004.

[7] A. Barta, M.P. Consens, and A.O. Mendelzon, “Benefits of PathSummaries in an XML Query Optimizer Supporting MultipleAccess Methods,” Proc. Int’l Conf. Very Large Data Bases (VLDB),pp. 133-144, 2005.

[8] P.A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, andJ. Teubner, “Monetdb/xquery: A Fast Xquery Processor Poweredby a Relational Engine,” Proc. ACM SIGMOD Int’l Conf. Manage-ment of Data, pp. 479-490, 2006.

[9] N. Bruno, N. Koudas, and D. Srivastava, “Holistic Twig Joins:Optimal XML Pattern Matching,” Proc. ACM SIGMOD Int’l Conf.Management of Data (SIGMOD), 2002.

[10] L. Chen, A. Gupta, and M.E. Kurul, “Stack-Based Algorithms forPattern Matching on DAGs,” Proc. Int’l Conf. Very Large Data Bases(VLDB), 2005.

[11] S. Chen, H.-G. Li, J. Tatemura, W.-P. Hsiung, D. Agrawal, and K.S.Candan, “Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents,” Proc. Int’l Conf. Very LargeData Bases (VLDB), 2006.

[12] T. Chen and C.-Y. Chan, “Viewjoin: Efficient View-BasedEvaluation of Tree Pattern Queries,” Proc. Int’l Conf. Data Eng.(ICDE), 2010.

[13] T. Chen, J. Lu, and T.W. Ling, “On Boosting Holism in XML TwigPattern Matching Using Structural Indexing Techniques,” Proc.ACM SIGMOD Int’l Conf. Management of Data (SIGMOD), 2005.

[14] S.-Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, and C. Zaniolo,“Efficient Structural Joins on Indexed XML Documents,” Proc. Int’lConf. Very Large Data Bases (VLDB), 2002.

[15] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv, “XSEarch: ASemantic Search Engine for XML,” Proc. Int’l Conf. Very Large DataBases (VLDB), 2003.

[16] D. Florescu, D. Kossmann, and I. Manolescu, “IntegratingKeyword Search into XML Query Processing,” Computer Networks,vol. 33, pp. 119-135, 2000.

[17] M. Fontoura, V. Josifovski, E. Shekita, and B. Yang, “OptimizingCursor Movement in Holistic Twig Joins,” Proc. ACM Int’l Conf.Information and Knowledge Management (CIKM), 2005.

[18] H. Georgiadis and V. Vassalos, “Xpath on Steroids: ExploitingRelational Engines for Xpath Performance,” Proc. ACM SIGMODInt’l Conf. Management of Data (SIGMOD), pp. 317-328, 2007.

[19] G. Gottlob, C. Koch, and R. Pichler, “Efficient Algorithms forProcessing Xpath Queries,” ACM Trans. Database Systems, vol. 30,no. 2, pp. 444-491, 2005.

[20] G. Gottlob, C. Koch, and K.U. Schulz, “Conjunctive Queries overTrees,” Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles ofDatabase Systems (PODS), pp. 189-200, 2004.

[21] G. Gou and R. Chirkova, “Efficient Algorithms for EvaluatingXpath over Streams,” Proc. ACM SIGMOD Int’l Conf. Managementof Data (SIGMOD), pp. 269-280, 2007.

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2257

[22] G. Gou and R. Chirkova, “Efficiently Querying Large XML DataRepositories: A Survey,” IEEE Trans. Knowledge Data Eng., vol. 19,no. 10, pp. 1381-1403, Oct. 2007.

[23] W.-S. Han, H. Jiang, H. Ho, and Q. Li, “Streamtx: ExtractingTuples from Streaming XML Data,” VLDB Endowment, vol. 1,no. 1, pp. 289-300, 2008.

[24] V. Hristidis, Y. Papakonstantinou, and A. Balmin, “KeywordProximity Search on XML Graphs,” Proc. Int’l Conf. Data Eng.(ICDE), pp. 367-378, 2003.

[25] H. Jiang, H. Lu, and W. Wang, “Efficient Processing of XML TwigQueries with Or-Predicates,” Proc. ACM SIGMOD Int’l Conf.Management of Data (SIGMOD), 2004.

[26] H. Jiang, H. Lu, W. Wang, and B.C. Ooi, “XR-Tree: Indexing XMLData for Efficient Structural Joins,” Proc. Int’l Conf. Data Eng.(ICDE), 2003.

[27] H. Jiang, W. Wang, H. Lu, and J.X. Yu, “Holistic Twig Joins onIndexed XML Documents,” Proc. Int’l Conf. Very Large Data Bases(VLDB), 2003.

[28] Y. Kanza and Y. Sagiv, “Flexible Queries over SemistructuredData,” Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles ofDatabase Systems (PODS), 2001.

[29] R. Kaushik, P. Bohannon, J.F. Naughton, and H.F. Korth, “Cover-ing Indexes for Branching Path Queries,” Proc. ACM SIGMODInt’l Conf. Management of Data (SIGMOD), pp. 133-144, 2002.

[30] R. Kaushik, R. Krishnamurthy, J.F. Naughton, and R. Ramakrish-nan, “On the Integration of Structure Indexes and Inverted Lists,”Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD),pp. 779-790, 2004.

[31] H. Li, M.-L. Lee, W. Hsu, and C. Chen, “An Evaluation of XmlIndexes for Structural Join,” SIGMOD Record, vol. 33, no. 3, pp. 28-33, 2004.

[32] Y. Li, C. Yu, and H.V. Jagadish, “Schema-Free XQuery,” Proc. Int’lConf. Very Large Data Bases (VLDB), pp. 72-83, 2004.

[33] J. Lu, T. Chen, and T.W. Ling, “Efficient Processing of XML TwigPatterns with Parent Child Edges: A Look-Ahead Approach,”Proc. ACM Int’l Conf. Information and Knowledge Management(CIKM), 2004.

[34] J. Lu, T.W. Ling, C.-Y. Chan, and T. Chen, “From Region Encodingto Extended Dewey: On Efficient Processing of XML Twig PatternMatching,” Proc. Int’l Conf. Very Large Data Bases (VLDB), 2005.

[35] T. Milo and D. Suciu, “Index Structures for Path Expressions,”Proc. Int’l Conf. Database Theory (ICDT), pp. 277-295, 1999.

[36] M.M. Moro, Z. Vagena, and V.J. Tsotras, “Tree-Pattern Queries ona Lightweight XML Processor,” Proc. Int’l Conf. Very Large DataBases (VLDB), pp. 205-216, 2005.

[37] D. Olteanu, “Forward Node-Selecting Queries over Trees,” ACMTrans. Database Systems, vol. 32, no. 1, article 3, 2007.

[38] D. Olteanu, “Spex: Streamed and Progressive Evaluation ofXPath,” IEEE Trans. Knowledge and Data Eng., vol. 19, no. 7,pp. 934-949, July 2007.

[39] F. Peng and S.S. Chawathe, “XPath Queries on Streaming Data,”Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD),pp. 431-442, 2003.

[40] D. Phillips, N. Zhang, I.F. Ilyas, and M.T. Ozsu, “Interjoin:Exploiting Indexes and Materialized Views in XPath Evaluation,”Proc. Int’l Conf. Scientific and Statistical Database Management(SSDBM), pp. 13-22, 2006.

[41] P. Rao and B. Moon, “Prix: Indexing and Querying XML UsingPrufer Sequences,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 288-300,2004.

[42] S. Souldatos, X. Wu, D. Theodoratos, T. Dalamagas, and T.K.Sellis, “Evaluation of Partial Path Queries on XML Data,” Proc.ACM Conf. Conf. Information and Knowledge Management (CIKM),pp. 21-30, 2007.

[43] D. Theodoratos, T. Dalamagas, A. Koufopoulos, and N. Gehani,“Semantic Querying of Tree-Structured Data Sources UsingPartially Specified Tree Patterns,” Proc. ACM Conf. Conf. Informa-tion and Knowledge Management (CIKM), 2005.

[44] D. Theodoratos, P. Placek, T. Dalamagas, S. Souldatos, and T.K.Sellis, “Containment of Partially Specified Tree-pattern Queries inthe Presence of Dimension Graphs,” VLDB J., vol. 18, no. 1,pp. 233-254, 2009.

[45] D. Theodoratos and X. Wu, “Assigning Semantics to Partial Tree-Pattern Queries,” Data Knowledge Eng., vol. 64, no. 1, pp. 242-265,2008.

[46] H. Wang, J. Li, J. Luo, and H. Gao, “Hash-Base Subgraph QueryProcessing Method for Graph-Structured XML Documents,”VLDB Endowment, vol. 1, no. 1, pp. 478-489, 2008.

[47] X. Wu, S. Souldatos, D. Theodoratos, T. Dalamagas, and T.K.Sellis, “Efficient Evaluation of Generalized Path Pattern Querieson XML Data,” Proc. Int’l Conf. World Wide Web (WWW), pp. 835-844, 2008.

[48] X. Wu, D. Theodoratos, S. Souldatos, T. Dalamagas, and T.K.Sellis, “Efficient Evaluation of Generalized Tree-Pattern Querieswith Same-Path Constraints,” Proc. Int’l Conf. Scientific andStatistical Database Management (SSDBM), pp. 361-379, 2009.

[49] X. Wu, D. Theodoratos, and W.H. Wang, “Answering XMLQueries Using Materialized Views Revisited,” Proc. ACM Conf.Information and Knowledge Management (CIKM), pp. 475-484, 2009.

[50] Y. Wu, J.M. Patel, and H.V. Jagadish, “Structural Join OrderSelection for XML Query Optimization,” Proc. Int’l Conf. Data Eng.(ICDE), 2003.

[51] B. Yang, M. Fontoura, E. Shekita, S. Rajagopalan, and K. Beyer,“Virtual Cursors for XML Joins,” Proc. ACM Int’l Conf. Informationand Knowledge Management (CIKM), 2004.

[52] C. Yu and H.V. Jagadish, “Querying Complex StructuredDatabases,” Proc. Int’l Conf. Very Large Data Bases (VLDB),pp. 1010-1021, 2007.

[53] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman, “OnSupporting Containment Queries in Relational Database Manage-ment Systems,” Proc. ACM SIGMOD Int’l Conf. Management ofData (SIGMOD), 2001.

Xiaoying Wu received the BS degree fromCentral South University in China and the MSdegree from the National University of Singa-pore both in computer science. She received thePhD degree in computer science from theComputer Science Department at the NewJersey Institute of Technology. She is currentlydoing postdoctoral research at Columbia Uni-versity. She has published and conductedresearch on semistructured data and XML,

keyword-query search, query semantics, and query processing andoptimization.

Stefanos Souldatos received the diplomadegree in electrical and computer engineeringfrom the National Technical University ofAthens (NTUA), Greece, in 2002, the MBAdegree from the ALBA Graduate BusinessSchool, Greece, in 2003, and the PhD degreefrom the NTUA in 2008. He is the recipient ofdifferent scholarships and awards. His re-search interests include XML, XML queryprocessing, XML query optimization, query

containment, web search, and personalization.

Dimitri Theodoratos received the diplomadegree in electrical and computer engineeringfrom the National Technical University ofAthens, Greece, the MSc degree from the EcoleNationale Superieure de Telecommunications ofParis, France, and the PhD degree from theUniversity of Paris at Orsay, France, both incomputer science. He is an associate professorin the Department of Computer Science at theNew Jersey Institute of Technology (NJIT). Prior

to NJIT, he taught at the National Technical University of Athens, theUniversity of Ioannina, Greece, and the University of Paris at Orsay, andhas been an ERCIM postdoctoral fellow at INRIA in Paris and at RAL inOxfordshire, United Kingdom. His research interests include data basetheory, data integration, data warehousing, XML, query processing andoptimization, and Semantic Web.

2258 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 12, DECEMBER 2012

Theodore Dalamagas received the diplomadegree in electrical engineering from the Na-tional Technical University of Athens (NTU) ,Greece, in 1996, the MSc degree in advancedinformation systems from Glasgow University,Scotland, in 1997, and the PhD degree fromNTU Athens in 2004. He is currently a research-er in the IMIS Institute of “Athena” R.C. From2005 to 2006, he was a lecturer in the Depart-ment of Computer Science and Technology,

University of Peloponnese. His current research interests include:intelligent information retrieval, data clustering methods, data seman-tics, data integration, sequence data management, and tree patternquery processing.

Yannis Vassiliou received the BSc degree inmathematics from the University of Athens, andthe MSc and PhD degrees from the University ofToronto. He is the director of the Institute ofCommunications and Computer Systems(ICCS) and a professor in the Computer ScienceDivision at the National Technical University ofAthens, Greece. He was an associate professorin the Department of Information Systems,University of New York, professor in the Depart-

ment of Computer Science, University of Crete, and director of theInstitute of Computer Science (FORTH). His research interests includedata warehouses, information systems, business intelligence, ERPsystems, E-commerce, business enterprises modeling, multimediadatabase systems, human-computer interaction, and natural languageprocessing.

Timos Sellis received the diploma degreefrom the National Technical University ofAthens (NTUA), the MSc degree from HarvardUniversity, and the PhD degree from theUniversity of California at Berkeley, where hewas a member of the INGRES group. He isthe director of the Institute for the Managementof Information Systems (IMIS) and a professorat NTUA in Greece. He was an associateprofessor in the Department of Computer

Science at the University of Maryland, College Park. He has receivedthe Presidential Young Investigator award for 1990-1995 and theVLDB 1997 10 Year Paper Award for his work on spatial databases.He was the president of the National Council for Research andTechnology of Greece (2001-2003) and a member of the VLDBEndowment (1996-2000). His research interests include data streams,peer-to-peer databases, data warehouses, the integration of web anddatabases, and spatiotemporal databases. He is a fellow of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

WU ET AL.: PROCESSING AND EVALUATING PARTIAL TREE PATTERN QUERIES ON XML DATA 2259