[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses for Structured XML Content

Post on 17-Mar-2017




0 download

Embed Size (px)


  • XCluster Synopses for Structured XML Content

    Neoklis PolyzotisUniv. of California, Santa Cruzalkis@cs.ucsc.edu

    Minos GarofalakisIntel Research Berkeley



    We tackle the difficult problem of summarizing thepath/branching structure and value content of an XMLdatabase that comprises both numeric and textual values.We introduce a novel XML-summarization model, termedXCLUSTERs, that enables accurate selectivity estimates forthe class of twig queries with numeric-range, substring, andtextual IR predicates over the content of XML elements. Ina nutshell, an XCLUSTER synopsis represents an effectiveclustering of XML elements based on both their structuraland value-based characteristics. By leveraging techniquesfor summarizing XML-document structure as well as nu-meric and textual data distributions, our XCLUSTER modelprovides the first known unified framework for handlingpath/branching structure and different types of element val-ues. We detail the XCLUSTER model, and develop a system-atic framework for the construction of effective XCLUSTERsummaries within a specified storage budget. Experimen-tal results on synthetic and real-life data verify the effec-tiveness of our XCLUSTER synopses, clearly demonstratingtheir ability to accurately summarize XML databases withmixed-value content. To the best of our knowledge, oursis the first work to address the summarization problem forstructured XML content in its full generality.

    1. IntroductionThe Extensible Mark-up Language (XML) has rapidly

    evolved to an emerging standard for large-scale data ex-change and integration over the Internet. Being self-describing and hierarchical in nature, XML provides a suit-able data model that can adapt to several diverse domainsand hence enable applications to query effectively the vastamount of information available on the Web.

    Within the realm of XML query processing, XML sum-marization has emerged as an important component for theeffective implementation of high-level declarative queries.In brief, a concise XML summary, or synopsis, captures (inlimited space) the key statistical characteristics of the under-lying data and essentially represents a highly-compressed,

    Supported in part by NSF Grant IIS-0447966

    approximate version of the XML database. By executinga query over the synopsis, the optimizer can efficiently ob-tain selectivity estimates for different query fragments andthus derive the cost factors of candidate physical executionplans.

    One of the key challenges in this important prob-lem stems from the inherent complexity of XML data.More specifically, the information content of a semi-structured data store is encoded in both the structure ofthe XML tree as well as the values under different ele-ments. Moreover, the content of XML elements is in-herently heterogeneous, comprising of different types ofvalues, e.g., integers, strings, or free text, that can bequeried with different classes of predicates. As an ex-ample, an application may query an XML database withbibliographic information using the following path expres-sion1: //paper[year>2000][abstract ftcontains(synopsis , XML)]/title[contains(Tree)], whichwill select all titles of papers that were published after 2000,if their abstracts mention the terms synopsis and XMLand their title contains the substring Tree. To enable low-error selectivity estimates for such queries, an XML sum-mary clearly needs to capture the key correlations betweenand across the underlying path structure and value content,and provide accurate approximations for different types ofvalue distributions. Given that real-life XML data setscontain highly heterogeneous content, it becomes obviousthat realizing this important and challenging goal will pro-vide crucial support for the effective optimization of XMLqueries in practice.

    Related Work2. Summarizing a large XML data set forthe purpose of estimating the selectivity of complex querieswith value predicates is a substantially different and moredifficult problem than that of constructing synopses for flat,relational data (e.g., [20, 24]). Recent research studies havetargeted specific variants of the XML summarization prob-lem, namely, structure-only summarization [1, 18, 25], orstructure and value summarization only for numeric val-

    1In this example, we use the ftcontains operator from the Full-Textextensions to XPath [2]

    2Due to space constraints, a more detailed discussion of related work isdeferred to the full version of this paper.

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • ues [10, 13, 17, 19, 26]. Correlated Suffix Trees (CSTs) [7]and CXHist [14] are recently-proposed techniques thattackle the problem of XML selectivity estimation for sub-string predicates. CSTs, however, take a straightforwardapproach, simply treating string values as an extension ofthe XML structure; on the other hand, CXHist focuses onthe simple case of fully-specified linear XPath expressions.It is not at all clear if these techniques can be extended tothe more general problem of twig queries with predicateson heterogeneous value content.

    Contributions. In this paper, we address the challengingand important problem of XML summarization in the pres-ence of heterogeneous value content. We propose a novelclass of XML synopses, termed XCLUSTERs, that capture(in limited space) the key characteristics of the path andvalue distribution of an XML database and enable selec-tivity estimates for twig queries with complex path expres-sions and predicates on element content. In sharp con-trast to previous work, our proposed XCLUSTER modelprovides a unified summarization framework that enablesa single XML synopsis to effectively support twig querieswith predicates on numeric content (range queries), stringcontent (substring queries), and/or textual content (IR-stylequeries). To the best of our knowledge, ours is the first at-tempt to explore the key problem of XML summarizationin the context of heterogeneous element values. The maincontributions of our work can be summarized as follows.

    XCLUSTER Summarization Model. Our proposedXCLUSTER synopses rely on a clean, yet powerful model ofgeneralized structure-value clusters, a unified, clustering-based framework that can effectively capture the key cor-relations between and across structure and values of dif-ferent types. To handle value-based approximations, ourframework employs well-known techniques for numericand string values, and introduces the class of end-biasedterm histograms for summarizing the distribution of uniqueterms within textual XML content.

    XCLUSTER Construction Algorithm. We introduce aset of compression operations for reducing the size of anXCLUSTER synopsis and develop a systematic metric forquantifying the effect of a compression step on the accu-racy of the XML summary. Our proposed metric capturesthe impact on the structure-value clustering of the synopsisby taking into account the localized structural and value-based characteristics of the compressed area of the sum-mary. Based on this framework, we propose an efficient,bottom-up construction algorithm that builds an effectiveXCLUSTER synopsis for a specific space budget by apply-ing carefully selected compression steps on an initial de-tailed summary.

    Experimental Study Verifying the Effectiveness ofXCLUSTERs. We validate our approach experimentally

    with a preliminary study on real-life and synthetic data sets.Our results demonstrate that concise XCLUSTERs consti-tute an effective summarization technique for XML datawith heterogeneous content, enabling accurate selectivityestimates for complex twig queries with different classesof value predicates.

    2 Preliminaries

    Data Model. Following common practice, we model anXML document as a large, node-labeled tree T (V,E). Eachnode u V corresponds to an XML element and is char-acterized by a label (or, tag) assigned from some alpha-bet of string literals, that captures the elements seman-tics. Edges (ei, ej) E are used to capture the contain-ment of (sub)element ej under ei in the database. (We uselabel(ei), children(ei) to denote the label and set ofchild nodes for element node ei V .) In addition, eachelement node ei can potentially also contain a value of acertain type (denoted by value(ei)); we assume the exis-tence of a mapping function type from elements to a setof data types, such that type(ei) is the data type of valuevalue(ei). (Elements with no values are mapped to a spe-cial null data type.) Our study considers the following setof possible data types for XML-element values: NUMERIC: Captures numeric element values; for in-stance, in a bibliographic database, NUMERIC values wouldinclude book prices, publication years, and so on. Follow-ing the usual conventions for numeric database attributes,we assume the NUMERIC values range in an integer domain{0 . . .M 1}. STRING: Captures (short) string values in a bibli-ographic database, these would include author/publishernames and addresses, book titles, and so on. TEXT: Captures free-text element values in our biblio-graphic database example, these would include book fore-words and summaries, paper abstracts, and so on. Suchtextual values need to support an IR-style, full-text query-ing paradigm based on keyword/index-term search [3, 8].Based on the traditional set-theoretic, Boolean model of IR,TEXT values are essentially Boolean vectors over an under-lying dictionary of terms (where the ith entry of the vectoris 1 or 0 depending on whether the ith term appears in thefree-text data or not).3

    As an example, Figure 1 depicts a sample XML datatree containing bibliographic data. The document consistsof author elements, each comprising a name, and sev-eral paper and book sub-elements. Each paper and bookco