[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses for Structured XML Content

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - XCluster Synopses for Structured XML Content

Post on 17-Mar-2017

213 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • XCluster Synopses for Structured XML Content

    Neoklis PolyzotisUniv. of California, Santa Cruzalkis@cs.ucsc.edu

    Minos GarofalakisIntel Research Berkeley

    minos.garofalakis@intel.com

    Abstract

    We tackle the difficult problem of summarizing thepath/branching structure and value content of an XMLdatabase that comprises both numeric and textual values.We introduce a novel XML-summarization model, termedXCLUSTERs, that enables accurate selectivity estimates forthe class of twig queries with numeric-range, substring, andtextual IR predicates over the content of XML elements. Ina nutshell, an XCLUSTER synopsis represents an effectiveclustering of XML elements based on both their structuraland value-based characteristics. By leveraging techniquesfor summarizing XML-document structure as well as nu-meric and textual data distributions, our XCLUSTER modelprovides the first known unified framework for handlingpath/branching structure and different types of element val-ues. We detail the XCLUSTER model, and develop a system-atic framework for the construction of effective XCLUSTERsummaries within a specified storage budget. Experimen-tal results on synthetic and real-life data verify the effec-tiveness of our XCLUSTER synopses, clearly demonstratingtheir ability to accurately summarize XML databases withmixed-value content. To the best of our knowledge, oursis the first work to address the summarization problem forstructured XML content in its full generality.

    1. IntroductionThe Extensible Mark-up Language (XML) has rapidly

    evolved to an emerging standard for large-scale data ex-change and integration over the Internet. Being self-describing and hierarchical in nature, XML provides a suit-able data model that can adapt to several diverse domainsand hence enable applications to query effectively the vastamount of information available on the Web.

    Within the realm of XML query processing, XML sum-marization has emerged as an important component for theeffective implementation of high-level declarative queries.In brief, a concise XML summary, or synopsis, captures (inlimited space) the key statistical characteristics of the under-lying data and essentially represents a highly-compressed,

    Supported in part by NSF Grant IIS-0447966

    approximate version of the XML database. By executinga query over the synopsis, the optimizer can efficiently ob-tain selectivity estimates for different query fragments andthus derive the cost factors of candidate physical executionplans.

    One of the key challenges in this important prob-lem stems from the inherent complexity of XML data.More specifically, the information content of a semi-structured data store is encoded in both the structure ofthe XML tree as well as the values under different ele-ments. Moreover, the content of XML elements is in-herently heterogeneous, comprising of different types ofvalues, e.g., integers, strings, or free text, that can bequeried with different classes of predicates. As an ex-ample, an application may query an XML database withbibliographic information using the following path expres-sion1: //paper[year>2000][abstract ftcontains(synopsis , XML)]/title[contains(Tree)], whichwill select all titles of papers that were published after 2000,if their abstracts mention the terms synopsis and XMLand their title contains the substring Tree. To enable low-error selectivity estimates for such queries, an XML sum-mary clearly needs to capture the key correlations betweenand across the underlying path structure and value content,and provide accurate approximations for different types ofvalue distributions. Given that real-life XML data setscontain highly heterogeneous content, it becomes obviousthat realizing this important and challenging goal will pro-vide crucial support for the effective optimization of XMLqueries in practice.

    Related Work2. Summarizing a large XML data set forthe purpose of estimating the selectivity of complex querieswith value predicates is a substantially different and moredifficult problem than that of constructing synopses for flat,relational data (e.g., [20, 24]). Recent research studies havetargeted specific variants of the XML summarization prob-lem, namely, structure-only summarization [1, 18, 25], orstructure and value summarization only for numeric val-

    1In this example, we use the ftcontains operator from the Full-Textextensions to XPath [2]

    2Due to space constraints, a more detailed discussion of related work isdeferred to the full version of this paper.

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • ues [10, 13, 17, 19, 26]. Correlated Suffix Trees (CSTs) [7]and CXHist [14] are recently-proposed techniques thattackle the problem of XML selectivity estimation for sub-string predicates. CSTs, however, take a straightforwardapproach, simply treating string values as an extension ofthe XML structure; on the other hand, CXHist focuses onthe simple case of fully-specified linear XPath expressions.It is not at all clear if these techniques can be extended tothe more general problem of twig queries with predicateson heterogeneous value content.

    Contributions. In this paper, we address the challengingand important problem of XML summarization in the pres-ence of heterogeneous value content. We propose a novelclass of XML synopses, termed XCLUSTERs, that capture(in limited space) the key characteristics of the path andvalue distribution of an XML database and enable selec-tivity estimates for twig queries with complex path expres-sions and predicates on element content. In sharp con-trast to previous work, our proposed XCLUSTER modelprovides a unified summarization framework that enablesa single XML synopsis to effectively support twig querieswith predicates on numeric content (range queries), stringcontent (substring queries), and/or textual content (IR-stylequeries). To the best of our knowledge, ours is the first at-tempt to explore the key problem of XML summarizationin the context of heterogeneous element values. The maincontributions of our work can be summarized as follows.

    XCLUSTER Summarization Model. Our proposedXCLUSTER synopses rely on a clean, yet powerful model ofgeneralized structure-value clusters, a unified, clustering-based framework that can effectively capture the key cor-relations between and across structure and values of dif-ferent types. To handle value-based approximations, ourframework employs well-known techniques for numericand string values, and introduces the class of end-biasedterm histograms for summarizing the distribution of uniqueterms within textual XML content.

    XCLUSTER Construction Algorithm. We introduce aset of compression operations for reducing the size of anXCLUSTER synopsis and develop a systematic metric forquantifying the effect of a compression step on the accu-racy of the XML summary. Our proposed metric capturesthe impact on the structure-value clustering of the synopsisby taking into account the localized structural and value-based characteristics of the compressed area of the sum-mary. Based on this framework, we propose an efficient,bottom-up construction algorithm that builds an effectiveXCLUSTER synopsis for a specific space budget by apply-ing carefully selected compression steps on an initial de-tailed summary.

    Experimental Study Verifying the Effectiveness ofXCLUSTERs. We validate our approach experimentally

    with a preliminary study on real-life and synthetic data sets.Our results demonstrate that concise XCLUSTERs consti-tute an effective summarization technique for XML datawith heterogeneous content, enabling accurate selectivityestimates for complex twig queries with different classesof value predicates.

    2 Preliminaries

    Data Model. Following common practice, we model anXML document as a large, node-labeled tree T (V,E). Eachnode u V corresponds to an XML element and is char-acterized by a label (or, tag) assigned from some alpha-bet of string literals, that captures the elements seman-tics. Edges (ei, ej) E are used to capture the contain-ment of (sub)element ej under ei in the database. (We uselabel(ei), children(ei) to denote the label and set ofchild nodes for element node ei V .) In addition, eachelement node ei can potentially also contain a value of acertain type (denoted by value(ei)); we assume the exis-tence of a mapping function type from elements to a setof data types, such that type(ei) is the data type of valuevalue(ei). (Elements with no values are mapped to a spe-cial null data type.) Our study considers the following setof possible data types for XML-element values: NUMERIC: Captures numeric element values; for in-stance, in a bibliographic database, NUMERIC values wouldinclude book prices, publication years, and so on. Follow-ing the usual conventions for numeric database attributes,we assume the NUMERIC values range in an integer domain{0 . . .M 1}. STRING: Captures (short) string values in a bibli-ographic database, these would include author/publishernames and addresses, book titles, and so on. TEXT: Captures free-text element values in our biblio-graphic database example, these would include book fore-words and summaries, paper abstracts, and so on. Suchtextual values need to support an IR-style, full-text query-ing paradigm based on keyword/index-term search [3, 8].Based on the traditional set-theoretic, Boolean model of IR,TEXT values are essentially Boolean vectors over an under-lying dictionary of terms (where the ith entry of the vectoris 1 or 0 depending on whether the ith term appears in thefree-text data or not).3

    As an example, Figure 1 depicts a sample XML datatree containing bibliographic data. The document consistsof author elements, each comprising a name, and sev-eral paper and book sub-elements. Each paper and bookcomprises a (STRING-valued) title, a (NUMERIC-valued)year of publication, as well as (TEXT-valued) keywords,

    3In future work, we intend to explore more flexible, Vector-Space IRmodels [4] of representing and querying TEXT values in XML documents,and their impact on our summarization framework.

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • d0

    a1

    a11

    p2

    n6{...}

    n12{...} b13

    y3{2000}

    t4{Counting...}

    k5{ XML,Summary,..}

    p7

    y14{2002}

    t15{Database...}

    f16{Database

    systems have...}

    y8{2002}

    t9{Holistic...}

    ab10{XML

    employs a...}

    q0

    .//p[y>2000]

    q1./t[contains(Tree)]

    ./abs[ftcontains(synopsis)]

    q2

    q3

    Figure 1. Example XML document. Figure 2. Example query.

    abstract, and/or foreword sub-elements. Note that ele-ment nodes in the tree are named with the first letter of theelements tag plus a unique identifier.

    We believe that the above set of value types adequatelycaptures the bulk of real-world XML content. Textual in-formation (i.e., STRING and TEXT values), in particular, isan integral part of real-life XML documents this is clearlydemonstrated by numerous recent research as well as stan-dardization efforts that attempt to integrate XML query lan-guages and query processing with substring and term-basedsearch models [3, 8, 2]. Our work, however, is the firstto consider the implications of different types of numericand textual information on the difficult problem of effectiveXML summarization [10, 17].

    Query Model. The focus of this paper is twig queries withvalue predicates. More specifically, a twig query Q is node-and edge-labeled tree Q(VQ, EQ) where each node qi VQrepresents a query variable that is bound, during query eval-uation, to a set of elements from the input document (we as-sume that q0 is the root of the query and is always mappedto the root of the document). Figure 2 shows an exampletwig query over the sample document of Figure 1. An edge(qi, qj) EQ denotes a structural constraint between theelements of the source and target variable, specified by anXPath expression edge-path(qi, qj). In our work, wefocus on XPath expressions that involve the child and de-scendant axis, wildcards, and optional predicates on pathbranches and element values. Conceptually, the evalua-tion of Q generates all possible assignments of elementsto query variables, such that both (a) the structural con-straints (as specified by the edge labels), and (b) the valueconstraints (as specified by the value predicates attached toquery nodes), are satisfied. This set of possible assignmentsconstitutes the set of binding tuples, and its cardinality isdefined as the selectivity s(Q) of the query.

    The class of supported value predicates in our twigqueries depends on the value types of the queried XML el-ements, and is defined as follows.

    NUMERIC range predicates of the general form [l, h], thatspecify a certain range [l, h] for the NUMERIC values of thedesignated XML elements; for example, find all the bookelements with prices between $60 and $80. STRING substring predicates of the general formcontains(qs), where qs denotes a query string. A sub-string predicate is satisfied by XML elements with STRINGvalues that contain qs as a substring (i.e., similar to the SQLlike predicate); for instance, return all books such that thepublisher name contains the (sub)string ACM. TEXT keyword predicates of the form ftcontains(t1,. . . , tk) (where t1, . . . , tk denote terms from the underly-ing term dictionary) specifying exact term matches; for ex-ample, find all paper elements with abstracts contain-ing the terms XML and synopsis. (Our techniques canalso handle other Boolean-model predicates, such as set-theoretic notions of document-similarity [4, 5].)

    3. XCLUSTER Synopsis ModelA Generic Structural Graph-Synopsis Model. Ab-stractly, our structural graph-synopsis model for an XMLdocument tree T (V,E) is defined by a partitioning of the el-ement nodes in V (i.e., an equivalence relation R V V )that respects element labels; in other words, if (ei, ej) Rthen label(ei) = label(ej). The graph synopsis definedforT by such an equivalence relationR, denoted by SR(T ),can be represented as a directed graph, where: (1) eachnode v in SR(T ) corresponds to an equivalence class ofR, i.e., a subset of (identically-labeled) data elements in T(termed the extent of v and denoted by extent(v)); and,(2) an edge (u, v) exists in SR(T ) if and only if some ele-ment node in extent(u) has a child element in extent(v).(We use label(v) to denote the common label of all dataelements in extent(v).)

    At a high level, several recently-proposed techniques forbuilding statistical summaries for XML databases (includ-ing XSKETCHes [17, 19] and TREESKETCHes [18]) are allroughly based on the abstract node-partitioning idea de-

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • scribed above. Unfortunately, none of these earlier researchefforts considered the impact of element values of possi-bly different types (most importantly, strings and unstruc-tured text), and their corresponding querying models onthe underl...

Recommended

View more >