[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - MiniCount: Efficient Rewriting of COUNT-Queries Using Views

Post on 09-Mar-2017

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • MiniCount: Efficient Rewriting of COUNT-Queries Using Views

    Vaclav Ln1 Vasilis Vassalos2 Prodromos Malakasiotis2

    1 Dept. of Information and Knowledge Eng.University of Economics, Prague, Czech Republic

    2 Dept. of InformaticsAthens University of Economics and Business

    Patission 76, Athens 10434 Greece

    Abstract.

    We present MiniCount, the first efficient sound andcomplete algorithm for finding maximally containedrewritings of conjunctive queries with count, using con-junctive views with count and conjunctive views withoutaggregation. An efficient and scalable solution to thisproblem yields significant benefits for data warehousingand decision support systems, as well as for powerfuldata integration systems. We first present a naive rewrit-ing algorithm implicit in the recent theoretical results byCohen et al. [5] and identify three independent sources ofexponential complexity in the naive algorithm, includ-ing an expensive containment check. Then we presentand discuss MiniCount and prove it sound and complete.We also present an experimental study that shows Mini-Count to be orders of magnitude faster than the naivealgorithm, and to be able to scale to large numbers ofviews.

    1 Introduction

    The problem of rewriting queries using views [12, 18]has recently received significant attention, because of itsrelevance to a wide variety of data management prob-lems: query optimization [3, 21], maintenance of physi-cal data independence [19], data integration [13, 20] anddata warehousing and decision support [22].In query optimization, trying to compute a query us-

    ing previously materialized views can speed up queryprocessing, since part of the computation needed forthe query may have been done during the computationof the views. In the database design problem, view de-finitions provide a way to support the independence ofthe physical view of the data and its logical view. Thisindependence allows us to modify the storage schema ofthe data without having to modify its logical schema,and to model more complex types of indices.Data integration systems provide a homogeneous

    query interface to a multitude of autonomous datasources. The users of data integration systems do notpose queries in terms of the schema in which the datais stored, but instead in terms of a mediated schema.This mediated schema is a set of relations that is de-signed especially for a specific data integration applica-tion. The system includes a set of source descriptions

    that provide semantic mappings between the relationsin the source schemas and the relations in the medi-ated schema. In several data integration systems thecontents of the sources are described as views over themediated schema. Therefore, the problem of answeringa user query, posed over the mediated schema, is equiv-alent to the problem of rewriting it into a query thatrefers directly to the source schemas.

    Even though much of the early work on the prob-lem of rewriting queries using views focused on queriesand views without grouping and aggregation, increas-ing the expressive power of the queries and the views toinclude aggregates is crucial for many important appli-cations. It is especially important for data warehousingand decision support applications where grouping andaggregation are very common operations [18], as wellas for data integration applications, where the indepen-dent sources may be statistical, may only be exportingsummary information of their contents [2], or may havegrouping and aggregation capabilities that can be ex-ploited. For this reason, the aggregate query rewritingproblem has received a lot of attention recently [15, 7,4, 10, 9, 5, 1].

    This paper proposes MiniCount, the first efficient al-gorithm for finding maximally contained rewritings ofqueries with count using views with count and con-junctive views without aggregation (i.e., unaggregatedviews). MiniCount is inspired by and substantially ex-tends ideas in the MiniCon algorithm [16] to generateonly contained rewritings for aggregate queries. Thequeries and views are conjunctive and are evaluated un-der bag-set semantics (i.e., evaluated under bag seman-tics on set-valued database relations). The algorithm isshown to be sound and complete, and it is shown to besignificantly better than the naive algorithm implicitin recent theoretical results about aggregate query con-tainment and rewriting [5]. In particular, we show howthe naive solution to the rewriting problem has three in-dependent sources of high complexity, including an ex-pensive containment check, while MiniCount eliminatestwo of these three sources of complexity, including thecontainment check, and often cuts down substantiallyon the third. The experimental evaluation shows that

    Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE

  • MiniCount is orders of magnitude faster than the naivealgorithm and scales well to large numbers of views.The algorithm applies to a significant subset of SQL

    that is important for data analysis and reporting,namely (non-nested) SQL queries with GROUPBY andCOUNT (but without the HAVING clause), and can be eas-ily extended to include SUM.The paper is organized as follows. Section 2 discusses

    related work in the area of query containment and queryrewriting for aggregate queries and views. Section 3 in-troduces terminology and examples, and reviews andextends the relevant existing theoretical results on ag-gregate query containment in [5]. In Section 3.5 wepresent a naive rewriting algorithm for maximally con-tained rewritings that is implicit in recent work by Co-hen et. al [5] and identify its sources of complexity. InSection 4 we present MiniCount and prove it sound andcomplete for count queries and count and unaggregatedviews. Finally, Section 5 presents experimental results.For a more detailed exposition of the work, includingproofs for all theorems and lemmas, see [14].

    2 Related work

    There has been a large body of work on rewritingqueries using views, and in particular on finding maxi-mally contained rewritings, in a variety of settings thatdo not include the presence of grouping and aggrega-tion operations in the queries and the views, e.g., see[8, 11]. In particular, [17] presents an efficient algorithmfor rewriting queries using views.The problem of equivalence of aggregate queries has

    been discussed in [15, 6, 9, 4], while the problem of find-ing equivalent rewritings of aggregate queries using ag-gregate views has been discussed in [18, 6, 7, 10]. It iscorrectly observed in [5] that these papers do not useunions of aggregate queries to rewrite aggregate queries,and as such they miss rewriting opportunities.Cohen et al. [6] provide rewriting templates for aggre-

    gate queries, based on the kind of aggregate functionsused in the queries and the views. Afrati and Chirkova[1] extend the results of [6] by providing more generalrewriting templates for equivalent conjunctive rewrit-ings of various classes of conjunctive queries.Cohen et al. [5] define and provide a solution

    to the problem of query containment for aggregatequeries with expandable aggregate functions3 (such asmax,sum,count), and they also investigate the problemof finding maximally contained sets of rewritings forconjunctive queries with count. In particular, they pro-vide a syntactic characterization of possible rewritings3 Consider two multisets, M and M, such that they havethe same elements and each element is contained in Mwith n-times greater multiplicity than in M. If aggregatefunction f is expandable, then f(M) = n f(M).

    that includes the definition of allowable rewriting tem-plates, as well as a bound on the size of rewritings andthe exclusion of new constants from rewritings (as in[12] for conjunctive queries). This characterization al-lows them to show the decidability of the problem. Italso implies a naive method for computing maximallycontained sets of rewritings.Our contribution is a provably efficient and scalable

    algorithm for query rewriting using views in the pres-ence of aggregation, and its experimental evaluation.We work within the theoretical framework of [5], whichwe extend to allow for more general rewriting templates.

    3 Conjunctive count query rewriting

    3.1 The Language of conjunctive count queries

    Conjunctive count queries (CCQ in the sequel) havethe form

    q(s, count) (z, x)where (z, x) is a conjunction of positive (i.e., non-negated) atomic formulas (atoms in the sequel) de-fined over the EDB predicates ( i.e names of stored data-base relations, EDB is for Extensional DataBase). Here

    s are a permutation of the variables in x x are the free variables, z are bound variables, bound by existential quantifiers,

    The free variables are also the grouping variables. Abound atom is an atom with at least one argument thatis a bound variable; otherwise, the atom is free. A CCQcan be written in the form

    q(s, count) f (x), b(z, x) ,where b is conjunction of bound atoms, f is conjunc-tion of free atoms. The core of q is the query

    qc(s) f (x), b(z, x) .Example 1. Consider a query

    q(x, y, count) a(y), a(x), b(x, y, z).Here s = [x, y] , x = [y, x], z = [z]. f (x) = a(y), a(x)and b(z, x) = b(x, y, z). We assume the extensions of EDB predicates are sets.

    The result of evaluating q over database D is denotedqD. Formally, queries are evaluated in two phases: first,the query core is evaluated (under bag semantics) andproduces a multiset (bag) of tuples from D. Then, theidentical tuples are grouped into equivalence classes,and to each equivalence class the aggregation function isapplied. An assignment is a mapping of variables from xand z to the set of constants of D. Satisfaction of an as-signment is defined in the obvious way. For a CCQ, qD

    Proceedings of the 22nd

Recommended

View more >