Transcript
Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

XCluster Synopses for Structured XML Content

Neoklis Polyzotis∗Univ. of California, Santa [email protected]

Minos GarofalakisIntel Research Berkeley

[email protected]

Abstract

We tackle the difficult problem of summarizing thepath/branching structure and value content of an XMLdatabase that comprises both numeric and textual values.We introduce a novel XML-summarization model, termedXCLUSTERs, that enables accurate selectivity estimates forthe class of twig queries with numeric-range, substring, andtextual IR predicates over the content of XML elements. Ina nutshell, an XCLUSTER synopsis represents an effectiveclustering of XML elements based on both their structuraland value-based characteristics. By leveraging techniquesfor summarizing XML-document structure as well as nu-meric and textual data distributions, our XCLUSTER modelprovides the first known unified framework for handlingpath/branching structure and different types of element val-ues. We detail the XCLUSTER model, and develop a system-atic framework for the construction of effective XCLUSTER

summaries within a specified storage budget. Experimen-tal results on synthetic and real-life data verify the effec-tiveness of our XCLUSTER synopses, clearly demonstratingtheir ability to accurately summarize XML databases withmixed-value content. To the best of our knowledge, oursis the first work to address the summarization problem forstructured XML content in its full generality.

1. IntroductionThe Extensible Mark-up Language (XML) has rapidly

evolved to an emerging standard for large-scale data ex-change and integration over the Internet. Being self-describing and hierarchical in nature, XML provides a suit-able data model that can adapt to several diverse domainsand hence enable applications to query effectively the vastamount of information available on the Web.

Within the realm of XML query processing, XML sum-marization has emerged as an important component for theeffective implementation of high-level declarative queries.In brief, a concise XML summary, or synopsis, captures (inlimited space) the key statistical characteristics of the under-lying data and essentially represents a highly-compressed,

∗Supported in part by NSF Grant IIS-0447966

approximate version of the XML database. By executinga query over the synopsis, the optimizer can efficiently ob-tain selectivity estimates for different query fragments andthus derive the cost factors of candidate physical executionplans.

One of the key challenges in this important prob-lem stems from the inherent complexity of XML data.More specifically, the information content of a semi-structured data store is encoded in both the structure ofthe XML tree as well as the values under different ele-ments. Moreover, the content of XML elements is in-herently heterogeneous, comprising of different types ofvalues, e.g., integers, strings, or free text, that can bequeried with different classes of predicates. As an ex-ample, an application may query an XML database withbibliographic information using the following path expres-sion1: //paper[year>2000][abstract ftcontains(

synopsis , XML)]/title[contains(Tree)], whichwill select all titles of papers that were published after 2000,if their abstracts mention the terms “synopsis” and “XML”and their title contains the substring “Tree”. To enable low-error selectivity estimates for such queries, an XML sum-mary clearly needs to capture the key correlations betweenand across the underlying path structure and value content,and provide accurate approximations for different types ofvalue distributions. Given that real-life XML data setscontain highly heterogeneous content, it becomes obviousthat realizing this important and challenging goal will pro-vide crucial support for the effective optimization of XMLqueries in practice.

Related Work2. Summarizing a large XML data set forthe purpose of estimating the selectivity of complex querieswith value predicates is a substantially different and moredifficult problem than that of constructing synopses for flat,relational data (e.g., [20, 24]). Recent research studies havetargeted specific variants of the XML summarization prob-lem, namely, structure-only summarization [1, 18, 25], orstructure and value summarization only for numeric val-

1In this example, we use the ftcontains operator from the Full-Textextensions to XPath [2]

2Due to space constraints, a more detailed discussion of related work isdeferred to the full version of this paper.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

ues [10, 13, 17, 19, 26]. Correlated Suffix Trees (CSTs) [7]and CXHist [14] are recently-proposed techniques thattackle the problem of XML selectivity estimation for sub-string predicates. CSTs, however, take a straightforwardapproach, simply treating string values as an extension ofthe XML structure; on the other hand, CXHist focuses onthe simple case of fully-specified linear XPath expressions.It is not at all clear if these techniques can be extended tothe more general problem of twig queries with predicateson heterogeneous value content.

Contributions. In this paper, we address the challengingand important problem of XML summarization in the pres-ence of heterogeneous value content. We propose a novelclass of XML synopses, termed XCLUSTERs, that capture(in limited space) the key characteristics of the path andvalue distribution of an XML database and enable selec-tivity estimates for twig queries with complex path expres-sions and predicates on element content. In sharp con-trast to previous work, our proposed XCLUSTER modelprovides a unified summarization framework that enablesa single XML synopsis to effectively support twig querieswith predicates on numeric content (range queries), stringcontent (substring queries), and/or textual content (IR-stylequeries). To the best of our knowledge, ours is the first at-tempt to explore the key problem of XML summarizationin the context of heterogeneous element values. The maincontributions of our work can be summarized as follows.

• XCLUSTER Summarization Model. Our proposedXCLUSTER synopses rely on a clean, yet powerful model ofgeneralized structure-value clusters, a unified, clustering-based framework that can effectively capture the key cor-relations between and across structure and values of dif-ferent types. To handle value-based approximations, ourframework employs well-known techniques for numericand string values, and introduces the class of end-biasedterm histograms for summarizing the distribution of uniqueterms within textual XML content.

• XCLUSTER Construction Algorithm. We introduce aset of compression operations for reducing the size of anXCLUSTER synopsis and develop a systematic metric forquantifying the effect of a compression step on the accu-racy of the XML summary. Our proposed metric capturesthe impact on the structure-value clustering of the synopsisby taking into account the localized structural and value-based characteristics of the compressed area of the sum-mary. Based on this framework, we propose an efficient,bottom-up construction algorithm that builds an effectiveXCLUSTER synopsis for a specific space budget by apply-ing carefully selected compression steps on an initial de-tailed summary.

• Experimental Study Verifying the Effectiveness ofXCLUSTERs. We validate our approach experimentally

with a preliminary study on real-life and synthetic data sets.Our results demonstrate that concise XCLUSTERs consti-tute an effective summarization technique for XML datawith heterogeneous content, enabling accurate selectivityestimates for complex twig queries with different classesof value predicates.

2 Preliminaries

Data Model. Following common practice, we model anXML document as a large, node-labeled tree T (V, E). Eachnode u ∈ V corresponds to an XML element and is char-acterized by a label (or, tag) assigned from some alpha-bet of string literals, that captures the element’s seman-tics. Edges (ei, ej) ∈ E are used to capture the contain-ment of (sub)element ej under ei in the database. (We uselabel(ei), children(ei) to denote the label and set ofchild nodes for element node ei ∈ V .) In addition, eachelement node ei can potentially also contain a value of acertain type (denoted by value(ei)); we assume the exis-tence of a mapping function type from elements to a setof data types, such that type(ei) is the data type of valuevalue(ei). (Elements with no values are mapped to a spe-cial null data type.) Our study considers the following setof possible data types for XML-element values:• NUMERIC: Captures numeric element values; for in-stance, in a bibliographic database, NUMERIC values wouldinclude book prices, publication years, and so on. Follow-ing the usual conventions for numeric database attributes,we assume the NUMERIC values range in an integer domain{0 . . .M − 1}.• STRING: Captures (short) string values – in a bibli-ographic database, these would include author/publishernames and addresses, book titles, and so on.• TEXT: Captures free-text element values – in our biblio-graphic database example, these would include book fore-words and summaries, paper abstracts, and so on. Suchtextual values need to support an IR-style, full-text query-ing paradigm based on keyword/index-term search [3, 8].Based on the traditional set-theoretic, Boolean model of IR,TEXT values are essentially Boolean vectors over an under-lying dictionary of terms (where the ith entry of the vectoris 1 or 0 depending on whether the ith term appears in thefree-text data or not).3

As an example, Figure 1 depicts a sample XML datatree containing bibliographic data. The document consistsof author elements, each comprising a name, and sev-eral paper and book sub-elements. Each paper and book

comprises a (STRING-valued) title, a (NUMERIC-valued)year of publication, as well as (TEXT-valued) keywords,

3In future work, we intend to explore more flexible, Vector-Space IRmodels [4] of representing and querying TEXT values in XML documents,and their impact on our summarization framework.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

d0

������������

������

����

a1

������������

��

����

����

����

����

a11

�� ������

����

p2

������

���

�� ������������

n6{...}

n12{...} b13

������

����

��

����

���

y3{2000}

t4{Counting...}

k5{ XML,Summary,..}

p7

����

y14{2002}

t15{Database...}

f16{Database

systems have...}

y8{2002}

t9{Holistic...}

ab10{XML

employs a...}

q0

.//p[y>2000]

��

q1./t[contains(Tree)]

����

���

./abs[ftcontains(synopsis)]

��

q2

q3

Figure 1. Example XML document. Figure 2. Example query.

abstract, and/or foreword sub-elements. Note that ele-ment nodes in the tree are named with the first letter of theelement’s tag plus a unique identifier.

We believe that the above set of value types adequatelycaptures the bulk of real-world XML content. Textual in-formation (i.e., STRING and TEXT values), in particular, isan integral part of real-life XML documents – this is clearlydemonstrated by numerous recent research as well as stan-dardization efforts that attempt to integrate XML query lan-guages and query processing with substring and term-basedsearch models [3, 8, 2]. Our work, however, is the firstto consider the implications of different types of numericand textual information on the difficult problem of effectiveXML summarization [10, 17].

Query Model. The focus of this paper is twig queries withvalue predicates. More specifically, a twig query Q is node-and edge-labeled tree Q(VQ, EQ) where each node qi ∈ VQ

represents a query variable that is bound, during query eval-uation, to a set of elements from the input document (we as-sume that q0 is the root of the query and is always mappedto the root of the document). Figure 2 shows an exampletwig query over the sample document of Figure 1. An edge(qi, qj) ∈ EQ denotes a structural constraint between theelements of the source and target variable, specified by anXPath expression edge-path(qi, qj). In our work, wefocus on XPath expressions that involve the child and de-scendant axis, wildcards, and optional predicates on pathbranches and element values. Conceptually, the evalua-tion of Q generates all possible assignments of elementsto query variables, such that both (a) the structural con-straints (as specified by the edge labels), and (b) the valueconstraints (as specified by the value predicates attached toquery nodes), are satisfied. This set of possible assignmentsconstitutes the set of binding tuples, and its cardinality isdefined as the selectivity s(Q) of the query.

The class of supported value predicates in our twigqueries depends on the value types of the queried XML el-ements, and is defined as follows.

• NUMERIC range predicates of the general form [l, h], thatspecify a certain range [l, h] for the NUMERIC values of thedesignated XML elements; for example, find all the book

elements with prices between $60 and $80.• STRING substring predicates of the general formcontains(qs), where qs denotes a query string. A sub-string predicate is satisfied by XML elements with STRING

values that contain qs as a substring (i.e., similar to the SQLlike predicate); for instance, return all books such that thepublisher name contains the (sub)string “ACM”.• TEXT keyword predicates of the form ftcontains(t1,. . . , tk) (where t1, . . . , tk denote terms from the underly-ing term dictionary) specifying exact term matches; for ex-ample, find all paper elements with abstracts contain-ing the terms “XML” and “synopsis”. (Our techniques canalso handle other Boolean-model predicates, such as set-theoretic notions of document-similarity [4, 5].)

3. XCLUSTER Synopsis ModelA Generic Structural Graph-Synopsis Model. Ab-stractly, our structural graph-synopsis model for an XMLdocument tree T (V, E) is defined by a partitioning of the el-ement nodes in V (i.e., an equivalence relation R ⊆ V ×V )that respects element labels; in other words, if (ei, ej) ∈ Rthen label(ei) = label(ej). The graph synopsis definedfor T by such an equivalence relation R, denoted by SR(T ),can be represented as a directed graph, where: (1) eachnode v in SR(T ) corresponds to an equivalence class ofR, i.e., a subset of (identically-labeled) data elements in T(termed the extent of v and denoted by extent(v)); and,(2) an edge (u, v) exists in SR(T ) if and only if some ele-ment node in extent(u) has a child element in extent(v).(We use label(v) to denote the common label of all dataelements in extent(v).)

At a high level, several recently-proposed techniques forbuilding statistical summaries for XML databases (includ-ing XSKETCHes [17, 19] and TREESKETCHes [18]) are allroughly based on the abstract “node-partitioning” idea de-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

scribed above. Unfortunately, none of these earlier researchefforts considered the impact of element values of possi-bly different types (most importantly, strings and unstruc-tured text), and their corresponding querying models onthe underlying XML-summarization problem. Obviously,given the prevalence of textual data and the importance ofsupporting full-text search in modern XML data reposito-ries [3, 8], this restriction severely limits the applicabilityof such earlier solutions in real-life problem scenarios.

The XCLUSTER Synopsis Model. Our proposedXCLUSTER synopses overcome the aforementioned lim-itations of earlier work by explicitly accounting for dif-ferent element-value types and querying requirements dur-ing the XML-document summarization process. The keyidea in XCLUSTERs is to form structure-value clustersthat, essentially, group together (i.e., summarize) XMLelements that are “similar” in terms of both their XML-subtree structure and their element values. More formally,given an XML document tree T (V, E) with element val-ues, we say that a node-partitioning graph synopsis of T istype-respecting if the underlying equivalence relation R ⊆V × V respects both element labels and value types; inother words, if (ei, ej) ∈ R then label(ei) = label(ej)and type(ei) = type(ej). (As previously, given a syn-opsis node v, label(v) and type(v) denote the commonlabel and value type, respectively, of all data elements inextent(v).) Our general XCLUSTER synopsis model isdefined as follows.

Definition 3.1 An XCLUSTER synopsis S for an XML doc-ument T is a (node- and edge-labeled) type-respectinggraph-synopsis for T , where each node u ∈ S stores: (1)an element count count(u) = |extent(u)| (also denotedas |u|); (2) per-edge (average) child counters count(u, v),for each edge (u, v) ∈ S, that record the average numberof children in extent(v) for each element in extent(u);and, (3) a value summary vsumm(u) that compactly sum-marizes the distribution of the collection of type(u) valuesunder all elements in extent(u).

Figure 3 shows a possible XCLUSTER synopsis forour example document, where each cluster essentiallyrepresents elements of the same tag. Intuitively, eachXCLUSTER node u corresponds to a structure-value clus-ter of count(u) elements in the original document tree T .Similar to the TREESKETCH synopses, the tuple of edge-counters (count(u, v1), count(u, v2), . . .) to the childrenof u gives the structural representative (or, centroid) of theu-cluster; moreover, vsumm(u) describes the distribution ofelement values (or, in a probabilistic sense, the value of an“average element”) in the u-cluster4. The key idea is that

4Note that, unlike the multi-dimensional histograms of [17], ourXCLUSTER value summaries do not explicitly try to capture correlationsbetween sets of values across different synopsis nodes. Uncovering such

all elements in the extent of node u are essentially approx-imated by count(u) occurrences of the u-cluster centroid,and the key to effective summarization is hence to try tokeep these clusters as “tight” as possible. This generalizedstructure-value clustering model is a key contribution of ourwork, providing a clean, unified framework for the effectivesummarization of both path structure and values of differenttypes in XML databases.

D(1)

2

��

A(2)

1

�����������������

1 ���

�����

0.5

��

N(2)vsumm(N )

P (2)

0.5

�����

���

0.5 ��1

����

����

�� 1

���������������� B(1)

1

��

1

��

1

������

����

K(1)vsumm(K)

AB(1)vsumm(AB)

Y (3)vsumm(Y )

T (3)vsumm(T )

F (1)vsumm(F )

Figure 3. Example XCLUSTER for the data of Figure 1.

XCLUSTER Value Summaries. We have thus far been de-liberately vague on the specifics of the vsumm value sum-maries stored within XCLUSTER nodes. Not surprisingly,the specific value-summarization mechanism used differsacross different value types; that is, the form of vsumm(u)depends on type(u). In what follows, we briefly dis-cuss the different classes of value summaries implementedwithin our XCLUSTER framework; of course, it is alwayspossible to extend our framework with support for differentvalue-summarization techniques and/or additional types ofelement values.• NUMERIC value summaries capture the frequency distri-bution of all numeric values in the extent of an XCLUSTER

node. Summarizing numeric frequency distributions is awell-studied problem in relational approximate query pro-cessing, and several known tools can be employed, in-cluding histograms [21], wavelets [16], and random sam-pling [15]. For concreteness, we focus primarily on his-tograms as our main NUMERIC value summarization tool,although our ideas can easily be extended to other tech-niques.• STRING value summaries capture the distribution of dif-ferent substrings in the collection of strings correspondingto the (STRING-valued) XML elements in an XCLUSTER

node’s extent. Following earlier work on the problem of

meaningful correlations across different value sets is itself a challengingproblem, and the benefits of the resulting multi-dimensional summariesare often limited by the “curse of dimensionality”; further, designing ef-fective multi-dimensional summaries for correlations spanning differentvalue types (e.g., text and numeric data) is, to the best of our knowledge,an open problem.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 5: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

substring selectivity estimation [6, 11], we employ PrunedSuffix Trees (PSTs) as our main summarization mechanismfor STRING values and substring queries. Our model isbased on a modification of the original PST proposal, wherethe PST records at least one node for each symbol that ap-pears in the string distribution (this implies that the prun-ing threshold [11] is redundant). We have found that thismodification yields accurate approximations, while avoid-ing large errors for negative substring queries.

• TEXT value summaries essentially summarize the col-lection of Boolean term vectors corresponding to the un-derlying set of elements in an XCLUSTER node’s extent.Our basic summarization mechanism here (similar to thecase of structure clusters) is the centroid of the underly-ing vector collection; in other words, letting w1, . . . , wk

denote the set of Boolean term vectors corresponding to aTEXT XCLUSTER node u (k = count(u)), then we de-fine vsumm(u) as the vector w, where w[t] =

∑k

i=1wi[t]/k

for each term t (thus, w[t] is the fractional frequency of tin the underlying set of texts). In practice, however, thenumber of terms (i.e., the dimensionality of our vector-centroid summary) can be quite large, thus placing a sig-nificant storage burden on TEXT XCLUSTER nodes. Thus,our XCLUSTER synopses also employ a second-level ofvalue summarization for TEXT nodes, where the vectorcentroid w is itself compressed further using a novel tech-nique, termed end-biased term histograms, that we intro-duce in this paper. Note that conventional histogrammingtechniques are likely to be ineffective for compressing suchterm vectors, since they optimize performance for rangeaggregates, whereas for TEXT values we are primarily in-terested in term-matching (i.e., “single-point”) queries; forinstance, by grouping consecutive frequencies in buckets,traditional histograms will almost certainly lose track ofzero-valued entries (i.e., non-existent terms). It is interest-ing to note that recent studies in Information Retrieval havelooked into the general problem of compressing the infor-mation stored in term vectors. These works, however, focuson dimensionality-reduction techniques for improving thequality of document clustering [12], or techniques for re-ducing the disk footprint of an IR engine while preservingrelevance rankings [9, 23]. It is not clear, therefore, if theycan be extended to the problem that we tackle in this paper,namely, selectivity estimation for point queries over term-vectors.

Briefly, our end-biased term histograms compress theterm-vector centroid w using a combination of (1) the top-few term frequencies in w (which are retained exactly); and,(2) a uniform bucket that comprises a (lossless) run-lengthcompressed encoding of the binary version of w (i.e., wherethe entry for t is 1 if w[t] > 0 and 0 otherwise), alongwith an average frequency for all the non-zero terms inthe bucket. To estimate the frequency w[t] using an end-

biased term histogram of w, we first try to lookup t in theexplicitly-retained top frequencies; if that fails, we lookup tin the run-length compressed uniform bucket returning theaverage bucket frequency if the corresponding entry is 1,and 0 otherwise. Thus, by avoiding the grouping of valuesinto bucket ranges and retaining a lossless representation ofthe 0/1 uniform bucket, our end-biased term histograms cir-cumvent the problems of conventional histograms giving aneffective summarization mechanism for term vectors.

4. Building XCLUSTER SummariesIn this section, we tackle the challenging problem of ef-

fective synopsis construction, and introduce novel, efficientalgorithms for building an accurate XCLUSTER summaryfor a given space-budget constraint.

At a high level, our goal during XCLUSTER construc-tion is to group XML elements in a manner that resultsin “tight” structure-value clusters, i.e., XCLUSTER nodesthat represent elements with high similarity in both theirsubtree structures and their value distributions in the origi-nal XML document. This natural analogy with clustering,however, highlights the key challenges of our XCLUSTER

construction problem. First, clustering problems are gener-ally known to be computationally intractable, even in thesimple case of a fixed set of points in low-dimensionalspaces [22, 27]. Second, our XCLUSTER nodes captureboth structure and value distributions — this means that ourclustering error metric has to appropriately combine boththe structural- and value-summarization errors of a cluster,in a manner that accurately quantifies the effects of sucherrors on the final approximate query estimates.

Our proposed XCLUSTER construction algorithm isbased on a generic bottom-up clustering paradigm: startingfrom a large, detailed reference XCLUSTER summary of theunderlying XML document, our construction process buildsa synopsis of the desired size by incrementally mergingXCLUSTER nodes with similar structure and value distribu-tions until the specified space constraint is met; in addition,our construction algorithm can also apply value-summarycompression operations that further compress the value-distribution information stored locally at synopsis nodes(e.g., by merging similar adjacent buckets in a NUMERIC-values histogram). In what follows, we first describe thesetwo key operations and their impact on the quality of clus-tering, and then introduce our proposed construction algo-rithm.

4.1. Merging XCLUSTER NodesConsider two nodes u and v with identical labels and

value types (i.e., label(u) = label(v) and type(u) =type(v)) in a given XCLUSTER synopsis S, and letparents(u) (children(u)) denote the set of parent (re-spectively, children) nodes of u in S (similarly for v). (Note

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 6: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

that, since S is a general directed-graph structure, the par-ent and children sets for nodes u and v may overlap, i.e.,u and v can have some common parents/children in S.)As previously, we use vsumm(u) and vsumm(v) to denotethe value-distribution summaries for the two XCLUSTER

nodes. Our XCLUSTER-construction process can choose toapply a node-merge operation on such (label/type compati-ble) node pairs in order to further reduce the space require-ments of the S synopsis.

Figure 4. Node-merge operation.

Intuitively, such a node-merge operation creates asmaller synopsis S′ from S (denoted by S′ = merge(S,u, v)), in which the two original nodes u and v are re-placed by a new node w representing the structure-valuecluster of the combined collection of XML elements in uand v; that is, extent(w) = extent(u)∪ extent(v). Ofcourse, all edges in the original S synopsis are maintainedin S′, i.e., parents(w) = parents(u)∪ parents(v) andchildren(w) = children(u)∪ children(v). (This isshown pictorially in Figure 4.) Furthermore, the structuralsummary information for the new node w is computed in anatural manner (based on the cluster-centroid semantics de-scribed in Section 3), as an appropriately-weighted combi-nation of the summary information in the two merged nodesu and v; more formally, we define count(w) = |w| = |u|+|v|, and, for each child node c and parent node p of w in thenew synopsis, we define the edge counts from/to w as

count(w, c) =|u|count(u, c) + |v|count(v, c)

|w|and

count(p, w) = count(p, u) + count(p, v),

where, of course, count(u, c) = 0 (count(p, u) = 0)if c �∈ children(u) (respectively, p �∈ parents(u))(and, similarly for v). Recall that our XCLUSTER edge-counts count(x, y) correspond to the average number of y-children per element of x — it is easy to see that the aboveformulas for the edge counts of the merged node w retainthese average-count semantics in the resulting synopsis S′.

The value-distribution summary information for w in S′

is similarly computed by appropriately “fusing” the value-distribution summaries of u and v to produce a summary forthe combined collection of element values in extent(u) ∪extent(v). That is, we define vsumm(w) = f(vsumm(u),vsumm(v)), where the specifics of the value-summary fu-sion function f() depend on the type of element values innodes u, v.

• If type(u) = type(v) = NUMERIC, we build the com-bined histogram vsumm(w) by merging bucket-count in-formation from the individual histogram vsumm(u) andvsumm(v). This involves an initial bucket alignment step,where vsumm(u) and vsumm(u) acquire buckets on thesame set of ranges by (potentially) splitting the rangesand counts of their existing buckets (based on conventionalhistogram-uniformity assumptions [21]); subsequently, thetwo histograms are merged to produce vsumm(w) by sum-ming the frequency counts across the aligned buckets.• If type(u) = type(v) = STRING, we build the com-bined PST vsumm(w) by starting out with an empty treeand simply inserting all substrings found in vsumm(u) andvsumm(v). The count for a substring s in vsumm(w) is setequal to the sum of the individual counts for s in vsumm(u)and vsumm(v).• If type(u) = type(v) = TEXT, then, assuming thatvsumm(u) and vsumm(v) denote the Boolean term vectorcentroids for u and v, we define the combined centroidvector for w as a simple weighted combination of the twoindividual vectors; that is, vsumm(w) = |u|

|w|vsumm(u)+|v||w|vsumm(v). (Merging of end-biased term histograms canbe defined in a similar manner.)

Quantifying Node-Merging Approximation Error. Ap-plying a node-merge operation on an XCLUSTER synop-sis S to obtain a smaller synopsis S′ = merge(S, u, v)increases the approximation error in the resulting sum-mary. Intuitively, this comes from fusing two structure-value clusters u, v in S into a single, “coarser” structure-value cluster w in S′. The increase in approximation er-ror when going from S to S′ (denoted by Δ(S,S′)) com-prises two key components: (1) Structural clustering errordue to the fusion of the two structure centroids (i.e., edge-count tuples) for the clusters extent(u) and extent(v)into a single, weighted structure centroid for the com-bined cluster extent(w) = extent(u) ∪ extent(v) inS′; and, (2) Value clustering error due to the merging of thetwo value summaries vsumm(u) and vsumm(v) (for valuesin extent(u) and extent(v), respectively) into a singlevalue summary vsumm(w) for the union of the two valuecollections.

One of the key challenges in our XCLUSTER synopsis-construction framework is to appropriately quantify andcombine these two forms of error in a meaningful overallapproximation-error difference Δ(S,S′) between the twosynopses S and S′. Obviously, the problem is further com-plicated by the several different types of values and valuepredicates that our synopses need to support,

The basic idea in our approach is to quantify the increasein both structure and value clustering error in S′ by essen-tially measuring their impact on the estimation errors for acollection of atomic queries involving the synopsis nodes

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 7: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

affected by the merge operation. The key observation hereis that our merge(S, u, v) operator has a very localized ef-fect on the synopsis, essentially changing the edge-countand value-distribution centroids attached to nodes u and v.Thus, it is possible to capture the average behavior of arbi-trary queries by measuring the estimation error on simple,localized query paths (or, atomic queries) in the affectedparts of the summary. (Taking an analog from the domainof measurements, our atomic query paths are essentially aset of micro-benchmarks that allows us to quantify differ-ent aspects of a system’s performance without a full-blownquery benchmark.)

More specifically, to quantify the increase in estimationerror, our set of atomic query paths comprises all paths ofthe form u[p]/c and v[p]/c in S, where c ∈ Cu ∪ Cv (Cx

is a shorthand for children(x)) and p is a simple, atomicvalue predicate on the underlying value-distribution sum-maries vsumm(u) and vsumm(v). (And, of course, the cor-responding set of w[p]/c paths in S′ = merge(S, u, v).)The exact definition of the atomic predicates p in our querypaths obviously depends on the nature of the summary: ForNUMERIC histograms, the atomic predicates correspond toall possible range predicates of the form [0, h] over the do-main of the summary;5 for STRING PSTs, atomic pred-icates correspond to all substrings in the summary; and,for TEXT summaries, atomic predicates refer to all indi-vidual terms. Without going into the details of our es-timation algorithms (Section 5), it is not difficult to seethat (based on the average-count semantics for our synopsisedge counts) the average number of c elements reached perelement of u by the u[p]/c path in S is exactly eS(u, p, c) =σp(u) · count(u, c), where the selectivity σp(u) of theatomic predicate p at u is estimated from vsumm(u) (and,similarly, for eS(v, p, c) and eS′(w, p, c)). Summing thesquared atomic-query estimation errors over all possible(p, c) combinations for the merged nodes u, v, we obtain theoverall increase in approximation error for the compressedsummary S′ = merge(S, u, v) as

Δ(S ,S ′) = |u|∑

p

∑c∈Cu∪Cv

(eS(u, p, c) − eS′(w, p, c))2

+|v|∑

p

∑c∈Cu∪Cv

(eS(v, p, c) − eS′(w, p, c))2.

Thus, by effectively quantifying the impact of a mergedstructure-value cluster on the error of all localized queryestimates affected by the merge operation, our clusteringerror metric provides us with an intuitive, meaningful mea-sure for unifying structure and value approximation errorsand guiding the choice of compression operations duringour bottom-up XCLUSTER construction algorithm. It is in-

5Using atomic predicates over ranges of the histogram is needed inorder to avoid introducing “holes” (i.e., zero-count ranges) in the mergedhistograms.

teresting to note that TREESKETCH construction [18] re-lies on a similar node-merging approach for building an ef-fective structural summary within a specific space budget.The TREESKETCH algorithm, however, evaluates mergeoperations based on a global structural clustering metricthat requires accessing a large, detailed count-stable sum-mary [18] during the build process. Our proposed approach,on the other hand, relies on a localized structure-value clus-tering metric that uses the current synopsis as the point ofreference; as a result, it is both efficient to compute and sig-nificantly less demanding in terms of memory overhead.

4.2. Compressing Value Summaries

As described above, fusing value-distribution summariesduring structural node merges essentially preserves all thedetail present in the original value summaries. In order toeffectively compress the value information in XCLUSTER

nodes (e.g., to meet a specified space budget), we now in-troduce appropriate value compression operations that canbe applied to different types of value summaries. For a spe-cific node u ∈ S with value summary vsumm(u), a valuecompression operation results in a new synopsis S′ wherevsumm(u) is a coarser approximation of the same value dis-tribution. To quantify the error that is introduced by sum-mary compression, we rely on the same metric as in thecase of structural merges, i.e., we quantify Δ(S,S′) as thesum-squared estimation error for the set of atomic queriesu[p]/c. The key difference here, of course, is that the struc-ture of the summary remains unchanged, so we only needthe first summand in the above formula for Δ(S,S′) (withw = u).

We introduce three value compression operations inour framework that cover the different types of valuesummaries: (a) hist cmprs for NUMERIC nodes, (b)tv cmprs for TEXT nodes, and (c) st cmprs forSTRING nodes.• hist cmprs(u, b). Here, u is a node with a histogramvalue summary vsumm(u) and b is a positive integer. Thisoperation results in a new histogram vsumm(u) that containsb buckets less than the original value summary. The newhistogram can be constructed from the original distribution,if it is available, or it can be formed by performing b mergeoperations on adjacent bucket-pairs in vsumm(u) (the lattercan be implemented without storing the original distributionand is thus more efficient.)• tv cmprs(u, b). Here, u is a summary node with a end-biased term histogram vsumm(u) and b is a positive integer.Similar to hist cmprs, this operation essentially reducesthe number of singleton “buckets” for vsumm(u), i.e., thenumber of terms for which the synopsis records exact fre-quency information, by moving the b lowest-frequency in-dexed terms to the uniform term bucket and appropriatelyadjusting the corresponding average frequency (used to ap-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 8: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

proximate all frequencies in the uniform bucket).• st cmprs(u, b). Here, u is a node with a PST summaryvsumm(u) and b is a positive integer. This operation resultsin the pruning of b leaf nodes from the vsumm(u) PST, andhence in a coarser approximation of the underlying stringdistribution. We propose a novel pruning scheme for PSTsummaries that removes a specified number of nodes whiletrying to minimize the resulting estimation error. More con-cretely, we associate with each leaf node x a pruning errorthat quantifies the difference in estimates, before and af-ter the pruning of x from the PST, for the sub-string thatx represents. Our observation is that this pruning error isa good indicator of the importance of node x in the PSTstructure, or equivalently, how well the Markovian estima-tion assumption [11] applies at x. Our pruning scheme re-moves b nodes based on their pruning error (thus attemptingto minimize the impact on estimation accuracy), while en-suring that the monotonicity constraint of the PST is alwayspreserved. In the interest of space, the complete details ofthis scheme can be found in the full version of this paper.

4.3 XCLUSTERBUILD Algorithm

Having introduced our key component operationsfor reducing the size of an XCLUSTER summary,we now describe our construction algorithm, termedXCLUSTERBUILD, for building an accurate XCLUSTER

synopsis within a specific storage budget.The pseudo-code for XCLUSTERBUILD is depicted in

Figure 5. The algorithm receives as input the XML datatree T , and two parameters, namely, Bstr and Bval, thatdefine the structural- and value-storage budget. In a nut-shell, Bstr specifies the storage for recording informationon the graph synopsis (nodes + edges + edge-counts), whileBval specifies the space that will be devoted to value sum-maries. The algorithm constructs an initial reference syn-opsis (line 1), that represents a very detailed clustering ofthe XML database, and subsequently reduces its size us-ing the operations that were introduced in the previous sec-tions (lines 2-18). As shown, the last step proceeds in twophases :(1) a structure-value merge phase, where merge op-erations are used in order to reduce the structure of the syn-opsis within Bstr space units, and (2) a subsequent value-summary compression phase, where the algorithm appliesvalue-based operations in order to compress the storage ofvalue-distribution summaries within Bval units. In short,the first phase generates a “tight” structure-value clusteringof input elements, while the second phase builds value sum-maries that accurately approximate the content of differentelement clusters. As we will discuss later, these two prop-erties have a strong connection to the assumptions of theXCLUSTER estimation framework and are effectively thekey for the construction of accurate synopses.

The key components of XCLUSTERBUILD, namely, the

Procedure XCLUSTERBUILD(T , Bstr , Bval)Input: XML Tree T ; Structural budget Bstr; Value budget Bval

Output: XCLUSTER synopsis S

begin1.Initialize S with the reference synopsis/** (1) Structure-Value Merge **/2.l := 1; Candstr := build pool(S , Hm, l)3.while structural information in S > Bstr do4. while |Candstr| > Hl do

5. Apply merge m ∈ Candstr : Δ(S,S′)|S|str−|S′|str

is minimized6. Remove m from Candstr and recompute losses7. end8. l := 1 + (max level of new vertices in this stage)9. Candstr := build pool(S,Hm,l) // Replenish pool10.end/** (2) Value-Summary Compression **/11.for each u ∈ S with values do // Init heap12. Candval ← (compression onvsumm(u))13.end14.while value information is S > Bval do15. Apply value compression m ∈ Candval : Δ(S,S′)

|S|−|S′|is minimized

16. Let vsumm(u) be the summary that m compressed17. Candval− = m; Candval+ = (compression on vsumm(u))18.done19.return S

end

Figure 5. Algorithm XCLUSTERBUILD.

computation of the reference synopsis and the two com-pression stages, are discussed in detail in the paragraphsthat follow. Before proceeding with our detailed discus-sion, we note that it is possible to invoke XCLUSTERBUILD

with a unified total space budget B and let the constructionprocess determine automatically the ratio of structural- tovalue-storage budget. One plausible approach, for instance,would be to perform a binary search in the range of pos-sible Bstr/Bval ratios, based on the observed estimationerror on a sample workload. This topic, however, is beyondthe scope of our current work and we intend to investigateit further in our future research.

Reference Synopsis Construction. Our reference synopsisis essentially a refinement of the lossless count-stable sum-mary of the input data [18]. More concretely, each clustergroups elements that have the same number of children inany other summary node and is associated with a detailedcontent summary that approximates the distribution of ele-ment values with low error. Moreover, each cluster has ex-actly one incoming path in order to capture potential path-to-value correlations. This detailed clustering in our (large)reference summary obviously provides a very accurate ap-proximation of the combined structural and value-based dis-tribution information of the input XML tree. (Due to spaceconstraints, we defer the complete details for our reference-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 9: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

synopsis construction to the full paper.)

Structure-Value Merge. The goal of this phase is to com-press the structure of the computed summary within Bstr

space units while preserving the key correlations betweenand across the structural and value-based distribution. Toachieve this, our algorithm applies a sequence of node-merge operations that are selected based on a marginallosses heuristic (line 5): among the candidate operations,the algorithm selects the merge m that yields the least dis-tance Δ(S,S′) (i.e., loss in accuracy) per unit of savedstorage, that is, m minimizes Δ(S,S′)/(|S|str − |S′|str),where S′ is the resulting synopsis and |S|str −|S′|str is thespace savings for structural information. In order to reducethe number of possible operations, our algorithm employstwo key techniques. First, it only considers operations froma pool Candstr of at most Hm candidate merges, whereHm is a parameter of the construction process. The poolis maintained as a priority queue based on marginal gains,thus making it very efficient to select the most effective op-eration at each step. Once the top operation is applied, thealgorithm re-configures the queue by computing the newmarginal losses for operations in the neighborhood of themerged nodes, and continues this process until the size ofCandstr falls below a specific threshold Hl. At that point,it rebuilds the pool with a new set of candidate operations(line 9) and begins a new round of cluster merges.

Procedure build pool(S , Hm, l)Input: XCLUSTER S ; Max heap size Hm; level l

Output: The pool Candstr of candidate mergesbegin1.Candstr := ∅2.for each (u, v) : u, v ∈ S do3. if label(u) = label(v) ∧ level(u), level(v) ≤ l then4. m ← (merge operation on u, v)5. Candstr.push(m)6. if |Candstr| > Hm then

7. Candstr− ={

m : m maximizes Δ(S,S′)|S|−|S′|

}

8. endif9. endif10.end11.return Candstr

end

Figure 6. Function build pool.

The second technique concerns the initialization ofthe pool with candidate merge operations (functionbuild pool shown in Figure 6). Clearly, the number ofcandidate merges grows quadratically with the number ofsynopsis nodes and an exhaustive exploration of all possi-ble operations may be prohibitively expensive (particularlyduring the first steps of the construction process that oper-ate on the large reference synopsis). To address this issue,we employ a heuristic that considers merge operations in a

bottom-up-fashion, starting from the leaf nodes of the sum-mary and gradually moving closer to the root. More con-cretely, each synopsis node is assigned to a level based onthe shortest outgoing path that leads to a leaf descendant.The key idea is that a merge of two nodes at level l + 1 ismore likely to have a low Δ if their respective children havebeen merged at level l (this matches the intuition that twoclusters are similar and hence can be merged if they pointto similar children.) The first initialization of the pool con-siders merges among nodes at levels 0 (leafs) and 1 (parentsof leafs); when the pool needs to be replenished with newcandidate operations, the algorithm considers merges up tolevel maxlevel+1, where maxlevel is the maximum levelof a newly created node from the previous pool. This heuris-tic is based again on the same intuition, namely, that mergesat level l may increase the effectiveness of merges at levell + 1 (that holds the parents of the merged nodes).

Value-Summary Compression. The goal of this stage is tocompress the value-distribution information of the synopsiswithin the specified value-storage budget. More concretely,the algorithm maintains a priority queue Candval that con-tains the minimum marginal loss value-compression opera-tion m (Section 4.2) for each vsumm(u) in the synopsis. Toensure efficiency, our algorithm only considers candidateoperations for a fixed value of b (typically, b = 1 in our ex-periments). Similar to the previous case, the candidate op-erations are ranked and applied according to their marginallosses, i.e., the value-structure distance Δ(S, S′) betweenthe current (S) and the updated synopsis (S’), normalizedby the savings in storage |S| − |S′| (line 20). Once an op-eration is applied, the algorithm updates the queue with anew operation for the modified value summary vsumm(u),and repeats the process until the specific budget constraintis met (or, the queue becomes empty.)

5. XCLUSTER Estimation

Our proposed estimation framework is based on an ex-tension of the estimation algorithm for the structural TREE-SKETCH synopses. More concretely, XCLUSTER estima-tion relies on the key concept of a query embedding, that is,a mapping from query nodes to synopsis nodes that satisfiesthe structural and value-based constraints specified in thequery. As an example, Figure 7 shows the embedding of asample query over a given XCLUSTER synopsis. Each nodeis annotated with the query variable that it maps to, whilethe edge-counts represent the average number of descen-dants per source element (we discuss their computation inthe next paragraph). Overall, the set of unique embeddingsprovides the possible evaluations of Q on the underlyingdatabase, and the overall selectivity can thus be approxi-mated as the sum of individual embedding selectivities.

To estimate the selectivity of an embedding, the

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 10: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

XCLUSTER algorithm employs the stored statistical infor-mation coupled with a generalized Path-Value Indepen-dence assumption that essentially de-correlates path distri-bution from the distribution of value-content. More for-mally, Path-Value Independence approximates the selec-tivity of a simple synopsis path u[p]/c as |u| · σp(u) ·count(u, c), where the fraction σp(u) can be readily es-timated from the value summary vsumm(u). (Note that wehave already hinted at the above formula in our discussionof the distance metric in Section 4.) Based on this approx-imation, our estimation algorithm traverses the query em-bedding and combines edge-counts and predicate selectiv-ities at each step in order to compute the total number ofbinding tuples. Returning to our example, consider nodesA : q1 and C : q2. Using path-value independence, the num-ber of descendants in q2 per element in q1 is computed ascount(A, B) · count(B, C) · σC(p) = 10 · 5 · 0.1, assumingthat σC(p) = 0.1 based on vsumm(C). Similarly, the totaldescendant count from A : q1 to E

a : q3 is estimated to becount(A, Da) · count(Da, Ea) = 10. Hence, each elementin A : q1 will generate 10 · 5 = 50 binding tuples as itwill be combined with every descendant in variables q2 andq3. Given that the root element in q0 has 10 descendants inA : q1, this will bring the total estimated number of bindingtuples to 50 · 10 = 500.

R

10��

A10

5

��

5��

���

B

5��

Da

2��

3

����

��D

b

4��C

vsumm(C) Ea

Eb

q0

//A��

q1/B/C[p]

������

�//E

��

q2 q3

R : q0

10��

A : q1

5

������

��10

��

C : q2 Ea : q3

(a) (b) (c)

Figure 7. (a) XCLUSTER (b) Query, (c) Embedding.

Clearly, the accuracy of XCLUSTER estimation dependsheavily on the validity of Path-Value Independence with re-spect to the underlying data. This assumption, however, issatisfied if element clusters are tight, i.e., they group to-gether elements that have the same structure and similardistribution of values. As outlined in Section 4, this is ex-actly the criterion used by our construction algorithm in or-der to compress a synopsis within a specific storage budget.This observation forms the key link between the construc-tion process and the proposed estimation framework, andprovides an intuitive justification for the effectiveness of ourconstruction algorithm.

6. Experimental Study

In this section, we present results from an empiricalstudy that we have conducted using our novel XCLUSTER

synopses over real-life and synthetic XML data sets.

6.1. Methodology

Techniques. We have completed a prototype implemen-tation of the XCLUSTER model that is outlined in Sec-tion 3. Our prototype considers the construction of value-summaries under specific paths of the underlying XML, andsupports histograms, PSTs, and end-biased term histogramsas approximation methods for value distributions.

Data Sets. We use two data sets in our evaluation: a sub-set of the real-life IMDB data set, and the XMark syntheticbenchmark data set. The main characteristics of our datasets are summarized in Table 6.1. The table lists the sizeof the XML document, the number of elements in eachdata set, the size of the reference synopsis, and the num-ber of nodes (total and nodes with value summaries only.)As mentioned earlier, the reference synopsis contains valuesummaries for specific paths only which are provided as in-put to the construction algorithm. In our experiments, weincluded at least one path for each different type of values,for a total of 7 paths for IMDB and 9 for XMark. In bothdata sets, we observe that the reference synopsis is consid-erably smaller than the input data but may still be too largefor the time and memory constraints of a query optimizer.

Workloads. We evaluate the accuracy of each synopsis ona workload of random positive twig queries, i.e, querieswith non-zero selectivity. The workload is generated bysampling twigs from the reference synopsis and attachingrandom predicates at nodes with values. (The samplingof paths and predicates is biased toward high-counts.) Ta-ble 6.1 summarizes the characteristics of our workloads forthe two data sets. We note that we have also performed ex-periments with negative workloads, i.e., queries with zeroselectivity. We omit the results as they have shown thatXCLUSTERs consistently yield close to zero estimates forall space budgets.

Evaluation Metric. We quantify the accuracy of anXCLUSTER summary with the average absolute relative er-ror of result estimates over the queries of a workload. Givena query q with true result size c and estimated selectivity e,the absolute relative error is computed as |c−e|/ max(c, s),where parameter s represents a sanity bound that essen-tially equates all low counts with a default count s and thusavoids inordinately high contributions from low-count pathexpressions. Following previous studies [17, 18], we set thisbound to the 10-percentile of the true counts in the work-load (i.e., 90% of the path expressions in the workload havea true result size ≥ s).

6.2. Experimental ResultsIn this section, we present the results of our experi-

mental study for evaluating the effectiveness of our novel

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 11: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

File Size(MB)

# ElementsRef. Size

(KB)# Nodes: Value/Total

IMDB 7.1 236822 473448 2037 / 3800XMark 10 206130 890745 3593 / 16446

Avg. Result SizeStruct Pred

IMDB 6727 123XMark 286341 1005

Table 1. Data Set Characteristics. Table 2. Workload Characteristics.

XCLUSTER summaries. In all the experiments that wepresent, we vary the structural summarization budget from0KB to 50KB while keeping the value summarization bud-get fixed at 150KB. (Hence, the total summary size variesfrom 150KB to 200KB). We have empirically verified thatthese settings provide a good balance between structuraland value-based summarization for the two data sets thatwe have used. As we have discussed in Section 4, the au-tomated allocation of a total space budget remains an in-teresting problem that we intend to investigate further inour future research. We note that a structural space bud-get of 0KB represents the smallest possible structural sum-mary that clusters elements based solely on their tags. Inall experiments, we set the maximum and minimum size ofthe candidate merge pool to Hm = 10000 and Hl = 5000respectively.

Figure 8 shows the average estimation error as a functionof the structural budget size, for the two data sets IMDB andXMark. For each data set, the plot depicts the overall esti-mation error (Overall), the estimation error for queries withpredicates on different types of value content (Numeric,Text, String), and queries without predicates (Struct). Wediscuss these results in more detail below.

Overall Estimation Error. Overall, the results indicatethat our novel XCLUSTERs synopses constitute a very ef-fective technique for estimating the selectivity of complextwig queries over structured XML content. For both IMDBand XMark, the overall estimation error falls below 10%for a modest total budget of 200KB, which represents a tinyfraction of the original data set size (Table 6.1). This levelof performance is very promising, considering the complex-ity of our workload: multi-variable twig queries, with valuepredicates on different types of XML content. We also ob-serve that the final error is considerably lower than the start-ing error of the smallest structural summary (73% for IMDBand 63% for XMark), which clearly demonstrates the ef-fectiveness of our novel structure-value clustering metric incapturing the key correlations between the path and valuedistribution of the underlying data.

Estimation Error for Value Predicates. The estimationerror for individual classes of value predicates follows thesame decreasing trend as the previous case, demonstratingagain the effectiveness of our summarization mechanism.A notable exception is the considerably increased error ofmore than 50% for queries with TEXT predicates over the

XMark data set. This result, however, is an artifact ofour test workload on XMark. More concretely, the XMarkTEXT predicates have very low selectivities, thus leading toan artificially high relative error even though the absoluteerror is low. To verify this, Figure 9 lists the average abso-lute error for low-count queries over the two data sets, whenthe synopsis size is fixed at 200KB. (A query is included inthese results if its true selectivity is below the sanity bounds.) As the results indicate, TEXT queries on XMark have anaverage absolute error of 1.09 tuples per query, which, com-bined with the average true selectivity of 3 tuples, yields theartificially high relative error.

Another interesting point concerns the estimation errorfor numeric predicates in the IMDB data set. More con-cretely, we observe an increase in the average error whenthe structural storage budget is varied from 35KB to 40KB.In this case, the compressed structure in the 35KB synop-sis effectively leads to a lower number of nodes with nu-meric values, that are however assigned the same budgetof 150KB for value-based summarization. This essentiallyresults in an increased number of buckets per numeric his-togram, thus yielding lower estimation errors for the smallersynopsis.

Error for Structural Queries. Finally, it is interestingto note the effectiveness of our novel localized clusteringmetric in the structural summarization of XML data. Forboth data sets, the estimation error for structural queries re-mains well below 5% for modest structural budgets (10KB–50KB), and is comparable to the original TREESKETCH

summaries [18] that target specifically the structural sum-marization problem. Moreover, the results indicate that ourlocalized Δ metric is equally effective to the global TREE-SKETCH clustering metric, which quantifies the differencesbetween successive compression steps based on a count-stable summary and hence requires repeated accesses to the(potentially large) reference synopsis.

7. Conclusions

The accurate summarization of XML data with heteroge-neous value content remains a critical problem in the effec-tive optimization of real-world XML queries. In this pa-per, we propose the novel class of XCLUSTER synopsesthat enable selectivity estimates for complex twig querieswith predicates on numeric (range queries), string (sub-string queries), and textual content (IR queries). We de-

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 12: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses

0

10

20

30

40

50

60

70

80

90

100

150 160 170 180 190 200

Avg. Rel Error (%)

Synopsis Size (KB)

TextStringNumericStructOverall

0

10

20

30

40

50

60

70

80

90

100

150 160 170 180 190 200

Avg. Rel Error (%)

Synopsis Size (KB)

TextStringNumericStructOverall

(a) (b)

Avg. Absolute ErrorIMDB XMark

Numeric 0.015 0String 5.12 0.5Text 0.18 1.09

Figure 8. XCLUSTER relative estimation error for complex twig

queries: (a) IMDB, (b) XMark.

Figure 9. Absolute estimation error for low-

count queries: (a) IMDB, (b) XMark.

fine the XCLUSTER model and develop a systematic con-struction algorithm for building accurate summaries withina specific space budget. Our results on real-life and syn-thetic data sets verify the effectiveness of our approach.

References[1] A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Esti-

mating the Selectivity of XML Path Expressions for InternetScale Applications. In VLDB, 2001.

[2] S.Amer-Yahia, C.Botev, S.Buxton, P. Case, J. Doerne, D.McBeath, M. Rys, and J. Shanmugasundaram. XQuery 1.0and XPath 2.0 Full-Text. W3C Working Draft.

[3] S. Amer-Yahia, L. V. S. Lakshmanan, and S. Pandit. FleX-Path: Flexible Structure and Full-Text Querying for XML.In ACM SIGMOD , 2004.

[4] R. Baeza-Yates and B. Ribeiro-Neto. “Modern InformationRetrieval”. ACM Press, Addison-Wesley Publishing Com-pany, 1999.

[5] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig.“Syntactic Clustering of the Web”. In WWW, 1997.

[6] S. Chaudhuri and V. Ganti and L. Gravano. Selectivity Esti-mation for String Predicates: Overcoming the Underestima-tion Problem. In ICDE, 2004.

[7] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrish-nan, R. Ng, and D. Srivastava. Counting Twig Matches in aTree. In ICDE, 2001.

[8] S. Cohen, J. Mamou, Y. Kanza, and Y. Sagiv. Xsearch: A se-mantic search engine for xml. In VLDB, pages 45–56, 2003.

[9] M Franz and J. S. McCarley. How Many Bits are Needed toStore Term Frequencies?. In SIGIR, 2002.

[10] J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simeon.StatiX: Making XML Count. In ACM SIGMOD , 2002.

[11] H. Jagadish, R. T. Ng, and D. Srivastava. Substring Selectiv-ity Estimation. In PODS, 1999.

[12] X. He and D. Cai and H. Liu and W.Y. Ma. Locality preserv-ing indexing for document representation. In SIGIR, 2004.

[13] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Parr.XPathLearner: An On-Line Self-Tuning Markov Histogramfor XML Path Selectivity Estimation. In VLDB, 2002.

[14] L. Lim, M. Wang, J. Vitter. CXHist : An On-lineClassification-Based Histogram for XML String SelectivityEstimation. In VLDB, 2005 (To appear).

[15] R. J. Lipton, J. F. Naughton, D. A. Schneider, and S. Se-shadri. Efficient sampling strategies for relational databaseoperations. Theoretical Comput. Sci., 116(1 & 2), 1993.

[16] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-Based His-tograms for Selectivity Estimation. In ACM SIGMOD , 1998.

[17] N. Polyzotis and M. Garofalakis. Structure and Value Syn-opses for XML Data Graphs. In VLDB, 2002.

[18] N. Polyzotis, M. Garofalakis, and Y. Ioannidis. ApproximateXML Query Answers. In ACM SIGMOD , 2004.

[19] N. Polyzotis, M. Garofalakis, and Y. Ioannidis. SelectivityEstimation for XML Twigs. In ICDE, 2004.

[20] V. Poosala and Y. E. Ioannidis. Selectivity Estimation With-out the Attribute Value Independence Assumption. In VLDB,1997.

[21] V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita. Im-proved Histograms for Selectivity Estimation of Range Pred-icates. In ACM SIGMOD , 1996.

[22] C. M. Procopiuc. Geometric Techniques for Clustering: The-ory and Practice. PhD thesis, Duke Univ., 2001.

[23] T Sakai and K Sparck-Jones. Generic summaries for index-ing in information retrieval. In SIGIR, 2001.

[24] J. S. Vitter and M. Wang. Approximate Computation of Mul-tidimensional Aggregates of Sparse Data Using Wavelets. InACM SIGMOD , 1999.

[25] W. Wang, H. Jiang, H. Lu, and J. X. Yu. Containment JoinSize Estimation: Models and Methods. In ACM SIGMOD,2003.

[26] Y. Wu, J. M. Patel, and H. Jagadish. Estimating AnswerSizes for XML Queries. In EDBT’02, 2002.

[27] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Ef-ficient Data Clustering Method for Very Large Databases. InACM SIGMOD , 1996.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE


Top Related