query evaluation in croque

17

Upload: uni-konstanz

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Query Evaluation in CROQUE�

� Calculus and Algebra Coincide �

Torsten Grust�� Joachim Kr�oger�� Dieter Gluche�� Andreas Heuer�� andMarc H� Scholl�

� University of KonstanzDatabase Research Group

P�O� Box D���� D����� Konstanz� Germanye�mail hfirstnamei�hlastnamei�uni�konstanz�de

� University of RostockDatabase Research GroupD����� Rostock� Germany

e�mail fjo�heuerg�informatik�uni�rostock�de

Abstract� With the substantial change of declarative query languagesfrom plain SQL to the so�called �object SQLs � in particular OQL� therehas surprisingly been not much change in the way problems of query re�presentation and optimization for such languages are tackled� We identifysome of the di�culties pure algebraic approaches experience when fac�ing object models and the operations de�ned for them� Calculus�styleformalisms suite this challenge better� but are said not to be e�cientlyimplementable in the database context�

This paper proposes a hybrid query representation and optimization ap�proach� combining the strengths of a many�sorted query algebra andthe monoid comprehension calculus� We show that e�cient executionplans beyond nested�loop processing can be derived � not only for ����� queries � in such a framework� The translation process accountsfor queries manipulating bulk�typed values by employing various joinmethods of the database engine� as well as queries that use aggregation�construction of arbitrary values� and arithmetics�

� Introduction

The choice of the intermediate representation for a declarative query languagesuch as OQL is guided by the vital characteristics that we demand from such arepresentation� namely

�a� the ability to capture the concepts �operations and types� of the query lan�guage completely� and

� CROQUE �Cost� and Rule�based Optimization of object�oriented Queries� is aresearch project funded by the German Research Association �DFG� under contractsHe ������� and Scho ������

�b� the suitability to serve as a starting point for a mapping from this repre�sentation to the primitives of the underlying persistent storage manager �fore�cient execution��

Traditional call interfaces to storage managers and query engines have beendesigned to be driven by algebraic query plans� Given a tree or DAG repre�senting the �optimized� query expressed in terms of the systems object algebra�an interpreter or intermediate compiler generates calls to the storage managerto perform the algebraic manipulations� In fact� some of these calls implementcertain algebra operators within the storage subsystem itself� Among them areselections� �p�� some projections� �� several join types� like �p �join�� np �semi�join�� np �antijoin� the complement to n�� and � p �two�sided outerjoin� as longas predicate p ful�lls simplicity restrictions� All other operations are carried outduring a postprocessing phase that mainly takes place in memory�

Algebraic approaches to query representation �t into this scenario� The initialalgebraic expression is rewritten so that large subtrees of the query tree can beprocessed under the responsibility of the storage subsystem� Apparently� algebrasful�ll requirement �b� well� However� with the substantial change of declarativequery languages from plain SQL to so�called object SQLs�� in particular OQL�there has surprisingly been almost no change in the way the problems of queryrepresentation and optimization are tackled� In some sense� query processors didnot re�ect the shift in paradigm the query languages themselves went through�

Algebras that have been derived from the well�known nested relational alge�bras �NRAs� still dominate the research work in this evolving �eld by far� Most ofthese approaches did not add expressiveness to NRA� though� which made themcover the classic SQL subset of OQL only� Notable exceptions are the e�orts of��� and ���� and their query algebras AQUA and PFL� respectively�

The AQUA data model and algebra has been designed to serve as a stan�dard� intermediate representation for object query languages� AQUAs operatorsmay operate on scalars as well as values of various bulk types� AQUA gains thisgenerality by de�ning operators as higher�order functions that themselve takefunctions as parameters� Among other uses� these function parameters repre�sent value constructors �e�g� set union or list concatenation�� thus turning aset�manipulating operator into a list manipulator by appropriate parameteriza�tion�

Poulovassilis and Smalls functional query language PFL realizes the generalapplication of functions and iteration abstractions �like fold� which are neededto implement OQL� The semantically clean foundation of PFL allowed for thederivation of a large body of equivalences� the basis for the development of aPFL query optimizer�

The issue of e�ciently mapping such languages to database query engineprograms� however� are not dicussed� It is one of the core work items of ourCROQUE project to bridge this gap for a language similar to PFL� the monoid

calculus ��� The calculus is expressive enough to represent OQL completely�Translating OQL into the monoid calculus results in uniform program schemes

�monoid comprehensions� which help to keep the expected size of an optimizerrewrite rule base as well as its search space in tractable bounds�

In ��� Daniel Chan proposed collection comprehensions �over sets� bags� andlists� as an intermediate representation for object SQLs� We map quanti�cationand aggregation to comprehensions too� making the intermediate query formseven more uniform� Further processing phases �mappings A � and A � to be in�troduced in Sect� �� bene�t from this simplicity� Chans execution engine isnon�standard in the sense that it does not support joins� but map and reduce

�fold �� Instead� our query compilation process assumes a richer algebraic queryengine as sketched above �with extensions� as its translation target�

We proceed as follows� In Section �� some light is shed on the consequencesof understanding OQL as a better SQL� only� Section � introduces CROQUEsinternal hybrid representation of OQL which� we believe� overcomes the majorde�ciencies of the pure algebraic query processing strategies� The query compi�lation process assumes an algebraic query engine as sketched above �with exten�sions� as its translation target� Section � summarizes�

� Algebraic Representations for OQL

Having the instruments of NRA at hand� there is a tendency to view OQL queriesas advanced� SQL queries� with an explicit emphasis on the e�cient translationof nested subqueries ���������

��� Nested selects

The following select�from�where clause is understood as the basic buildingblock of complex queries �let e denote e�� � � � � en��

select struct�a��ek� � � � � � ak�ekk�

from E� as e�� � � � � En as enwhere p�e�

The obvious algebraic translation would be

�ek� �����ekk ���e�p�e���E� � � � ��En��

�query optimizers then convert the Cartesian products into real joins if possible��

Note that the OQL select�from�where block delivers a bag �multiset� asits result type� which already demands a non�standard interpretation of theformer set�based algebraic operators� especially � and� in this case� Now� nestedsubqueries may occur in the where�clause �which has been allowed since the earlydays of SQL and has been extensively studied� e�g� in ��� and a whole series offollow�up papers� and the from�clause� Nested subqueries in the select�clause

have been investigated more recently� Consider

select struct�a��e�� a��select f�e�� e��from E� as e�where p��e�� e���

from E� as e�where p�e��

The above form immediately leads to a nested projection if we apply the abovealgebraic translation scheme� Execution of such plans often results in ine�cientnested loop processing� Consequently� recent work proposed non�standard joinoperators such as binary grouping �� � ���� hierarchical joins ���� or the nestjoin

��� ����� which we will look upon more closely in the following� Actually� theabove query is implemented by a single nestjoin which performs a join overpredicate p� and a grouping with regard to function f simultaneously�

���e��p�e����E��� ��e���e��f�e��e����e���e��p��e��e���a�

E�

The nestjoin allows for unnesting the subquery� operand E� appears at the top�level� giving us several alternatives of implementing this query e�ciently�

In R�f�p�MS� for each r � R� predicate p tests for each s � S if s is memberof the group �named M in the result� belonging to r� and� if so� adds the resultof f�r� s� to the group� As a simple example from the relational domain� consider

R

� A

a �b �c �

S

� B

a ��a ��c ��

R ��r��s��r�A�s�B���r��s��r��s����M

S

� A M

a � f��� ��gb � fgc � f��g

In contrast to the equivalent join�group�map sequence there is no need tointroduce null values� empty groups are represented as empty collections whichare part of the data model�

��� Non�linear Translations

We already end up with signi�cantly more complex translations if we take thefreedom to construct arbitrary values in the select�clause from the variableswe declared in the from�clause� like OQL allows us to�

select struct�x�e�x� y�e�y�M�set�e�x� e�y��

from E as e

First� note that this query deals with bags and sets simultaneously� which isquite typical for OQL queries but in no way common for proposed algebraicapproaches� Let us ignore this fact for the sake of this example�

We cannot translate this query by a standard projection �note that we usethe same values� e�g� e�x and e�y� more than once to build the complex result� or

nesting �set M includes values of di�erent columns of E�� Rather� a translationwould be

E �id��e���e���e��xe��e��ye���M

��x�E� � �y�E��

�where id � �x�x denotes the identity mapping�� This translation is non�linear

in the sense that expression E has to be copied and further processed every timewe reference a variable that ranges over it �here e� to construct our complexvalue� This is an inherent problem of all algebraic approaches� being variable�less formalisms by nature� The translation is likely to be ine�cient� A formalismthat incorporates variables� like an appropriate calculus� would allow for a directtranslation� Consider

bagfhx � e�x� y � e�y�M � setfe�x� e�yg�i j e� Eg

which denotes a bag comprehension� an expression of the monoid calculus weintroduce shortly� Variable e gets sequentially bound to the elements of E andcan be used in the head of the comprehension to construct values which are thencollected to make up the resulting bag� More importantly� the representation islinear�

If we abstract from the problem of value construction to the general ap�plication of functions under the variable bindings generated by the from� andwhere�clauses� we obtain a more general view of the basic select�block in OQL�

select f�e�from E� as e�� � � � � En as en

where p�e�

Algebras have been enriched by a very powerful operator� namely mapf �� toaccount for this generalization� mapf ��E� applies function f to every elementof E� making

mapf ����e�p�e���E� � � � ��En��

a translation of the above select� Note that f may be quite complex� in par�ticular an expression of the enhanced NRA itself�

��� Queries with Aggregates

If we consider more than just select�from�where �i�e� ����� respective map����� queries� we again have to enrich algebras� As an example� consider queryQ��

Q� � sum�select x

from flatten�E� as x

where x � ��

The sum aggregate � or operator � in general � has no algebraic counterpartyet� We could resort to treating such operators as black boxes� that are just

moved around during the query rewriting process� An optimizer that knowsabout �s properties �like being commutative� associative� and having identity��� however� has the choice of folding the aggregation into the program evaluatingthe select�from�whereblock� thus reducing the need for extra summation loops�This applies to aggregations in general� covering OQLs count� max�min� avg�exists� and forall �where the latter two aggregate boolean values��

The incorporation of general aggregations has been realized by means of foldoperators of various sorts ����� � ����� fold�� f ��E� applies function f to everyelement of E and accumulates the function results via the binary operator ��sum�E� is thus implemented by fold!� id��E�� The uses of fold are manifold� Wewill use folding of boolean values to implement quanti�cation as well� Considerfold���e�e � ����E� which implements the predicate �e � E � e � ���

The use of aggregation� operations on scalars like arithmetics� and multiplecollection types in one query are not captured by the algebras proposed in thevast majority of work on query algebras for object SQLs� Being strict� we wouldhave to say that these algebras miss requirement �a� and therefore fail to becomplete representations of OQL�like languages at all� To summarize� let us listthe features we identi�ed as crucial for an adequate OQL representation thatful�lls both requirements�

��� We clearly need specialized join operators �such as �� which can copewith certain nesting cases�

��� We need a means to express general function application �map�� the con�struction of arbitrary values in particular�

��� There have to be ways to reason about aggregation and quanti�cation�which we can look at as aggregation over boolean values�� i�e� fold�

��� The treatment of the diverse collection types �sets� bags� lists� shouldbe uniform to avoid special operators like list�set�join or bag�projection� In�corporation of new collection types should be possible seamlessly�

��� The Monoid Comprehension Calculus

CROQUEs approach incorporates the monoid comprehension calculus into itsquery representation to achieve these fundamental goals� The monoid calcu�lus actually does ful�ll the features �������� thus meeting requirement �a� fromSect� �� However� calculi are said to be not e�ciently implementable in thedatabase context� since the calculus expressions inherently describe nested loopprocessing strategies� Section � will present a hybrid approach� comprising an al�gebra and the monoid calculus� which shows that query execution plans beyondnested loop processing are in fact derivable in this framework� This will �nallyprovide us with a framework that meets both characteristics we identi�ed in theIntroduction�

We will introduce the ideas of the monoid comprehension calculus onlybrie�y here� ���������� give comprehensive surveys� The calculus can be lookedat as a generalization of the well�known relational calculus� It is more generalin the sense that we gain the ability to manipulate not only sets� but any type

that possesses the properties of a monoid� A monoid M is an abstract alge�braic structure equipped with an associative binary operation mergeM�� havingthe distinguished element zeroM� as its identity� Monoids that represent col�lections additionally specify the function unitM� that constructs a singletonvalue of that collection type� Within the monoid framework� we now repre�sent the bulk types set� bag� and list as triples �zeroM�� unitM��mergeM���resulting in �fg� feg���� �ffgg� ffegg���� and ��� e��!!� respectively� The proper�ties of mergeM� �being idempotent� commutative� or both�� in fact� make thethree collection types di�erent� Our primary focus is the manipulation of bulkdata� but the diverse types of aggregation and arithmetic operations we en�counter during the translation of OQL �t into this framework as well� Examplesof such non�collection monoids are sum � ��� id�!�� some � �false� id���� andall � �true� id��� While sum serves the obvious purpose� some and all imple�ment existential and universal quanti�cation� Similar monoid instances are max�min� and prod�

Queries are represented as a subclass of special monoid mappings� mono�

id homomorphisms� which we write down as monoid comprehensions� thecorresponding generalization of set comprehensions or ZF�expressions� The ho�momorphisms provide us with a semantically clean way to reason about queriesthat involve multiple collection and scalar types since the range and destinationmonoids of these mappings may not be the same ������

In the monoid comprehension Mff j q�� � � � � qng the quali�ers qi areeither generators of the form e � E or predicates p� A generator qi sequen�tially binds variable e to the elements of its range E� e is bound in qi��� � � � � qnand f � The binding of e is propagated until a predicate evaluates to false underthis binding� Function f � the head� is evaluated under all bindings that pass�and the function results are then accumulated via mergeM� to make up the�nal result� Comprehensions nest to arbitrary depth� Abstracting monoid ele�ment operations this way gives us the ability to reason about equivalences andrewriting rules independent of particular collection types� E�g�

Mff j q�� e� zeroN �� q�g zeroM�

�where the qi denote sequences of quali�ers� holds for any instantiation of mo�noidsM and N � making the expected size of an optimizer rewrite rule base moremanageable�

Examples of comprehensions are the following mapping from lists to bags

bagfe j e� �� �� �� ��� e �� �g � ff�� �� �gg

or

setfhe�� e�i j e� � E�� e� � E�� somefe� � c j c� e��Mgg

which computes the join of E� and E� with regard to predicate e� � e��M �

Note that for E of type N and � denoting mergeM� of some monoidM� thecomprehension Mfe� j e� E� e� � f�e�g is equivalent to fold�� f ��E�� i�e� the

operation we identi�ed as crucial for our query evaluation purposes�

fold�� f ��zeroN �� � zeroM�

fold�� f ��unitN ��e�� � f�e�

fold�� f ��mergeN ��e�� e��� � f�e��� fold�� f ��e��

These three equations re�ect the way of how values are constructed by the calcu�lus� namely by the application of zeroM�� unitM�� and mergeM� exclusively�

In fact� one can show that folds expressiveness is su�cient to capture OQL�In ��� we actually gave a complete mapping of OQL to the monoid comprehen�sion calculus� Summing things up� what we gain with the monoid calculus is ���the necessary expressive power to capture OQL�like languages� ��� a uniform andextensible way of manipulating values of diverse collection types� ��� a uniformtreatment of arithmetics� quanti�cation� and aggregation� and ��� a formalismthat can handle variables and bindings like the source query language can�

Mapping the source language� OQL� to the monoid calculus as the targetformalism is actually a straightforward process� The OQL from�clause translatesinto a sequence of generators� predicates in the where�clause introduce �lters�while the select�clause corresponds to a comprehensions head� i�e� functionapplication� Let T be the mapping from OQL to the monoid calculus� The coreof T would then read

T

��select f�e�from E� as e�� � � � � En as enwhere p�e�

�A � bagfT�f�e�� j e� � T�E���

� � � �

en � T�En��T�p�e��g

This meets the OQL semantics� including the case of one or more Ei beingempty collections� the overall result will be empty� ��� extends T to capturethe complete speci�cation of OQL ��� ��� The problems we experienced withthe construction of complex values� especially non�linear translations caused bythe use of variables� do not a�ect the calculus� as we have mentioned before�Consider

Q� � select distinct e

from E as e

where exists x in M� N � set�e� x�

�where � denotes the set inclusion predicate� and its image under T

T�Q�� � setfe j e� E� somefN � setfe� xg j x�Mgg

However� how are we supposed to �nd an e�cient algebraic execution plan fora nested comprehension like this"

� A Hybrid Translation and Optimization Schema for

OQL

We just observed that monoid comprehensions basically correspond to nestedfolds when translated into an extended nested algebra� While this is true� this

observation may lead to the impression that one is stuck with nested loops whenit comes to �nding execution plans for comprehensions� fold�� f � accesses eachelement of the argument collection �possibly even in order� if applied to a list��applies function f and then merges the result via �� There is no obvious strategyfor the e�cient evaluation of monoid calculus expressions like the strategy forNRA we sketched in the Introduction� We miss a mapping of nested comprehen�sions to the primitives of query engines which are tailored to evaluate terms ofa particular algebra�

On the other hand there are classes of OQL queries that clearly are of alge�braic nature� e�g�� compute joins� perform selections� simple projections� group�ing� and nesting� and we are better o� to translate these queries into algebraicplans directly� The large body of knowledge of optimization methods in this �eldshould clearly be reused if possible�

The apparent potential of the monoid calculus and the latter observationmotivated the hybrid approach to OQL representation and optimization that wepursue in the CROQUE project�

� Translate the source OQL query into a mix of monoid calculus and algebra�which we will introduce below� If we are able to detect subqueries that canbe computed by algebraic terms at this point� we choose an algebraic repre�sentation for these subqueries� To ensure a clean semantics for such mixedrepresentations we will de�ne the algebra operators by means of monoid cal�culus terms� In addition� these de�nitions will provide a de�nite semanticsfor the operators when applied to the di�erent collection types� We obtain amany�sorted query algebra this way�

� Perform rewriting on this representation� The whole body of rule�based al�gebraic rewriting techniques applies� We additionally may apply calculusrewrite rules which above all allow for simple unnesting of complex sub�queries� Comprehension normalization �which basically is accomplished byapplying the rewriting rules of Fig� �� actually implements numerous unnest�ing techniques found in the literature� Calculus expressions may be trans�lated into algebra and vice versa� The rewriting process follows the heuristicof translating calculus into equivalent algebraic terms�

� As a �nal step� convert remaining calculus terms into folds� We will showshortly how this can be done� It is a requirement for the underlying queryexecution engine to provide facilities for the evaluation of fold�

In the remainder� we will present the strategy sketched above� Let us continueby introducing the query algebra to the depth needed�

��� A Monoid�Aware Algebra

The following introduces some operators of the algebra by putting the operatorsde�nitions down to the equivalent monoid comprehensions as mentioned above�We gain several advantages with this approach� ��� a semantically clean inter�action of the algebra with the calculus is guaranteed� Calculus terms may play

the role of selection or join predicates� may represent functions �e�g�� as they areneeded as parameters for map and fold�� or may be arguments to algebra oper�ators in general� ��� The generic de�nitions hold for any monoid instantiation�as long as well�de�nedness conditions are obeyed� The algebra will operate onmultiple collection types� not only sets� ��� We uncover the control structure ofthe operators� Unfolding and loop fusion becomes possible�

Due to space restrictions� we merely present the basic ideas here� but thedesign of the operators follows a common principle� #� contains a comprehensivelist of all operators�

In the following� read symbol �� as has type� Let int� bool� and string be atomictypes� ha� ����� � � � � an ���ni is a record with �eld tags ai and respective �eld types�i� concatenates records� With M being a collection monoid �set� bag� list��and � being some type� �M is the type denoting a M�collection of ��members�e�g� int list�

The core of the algebra is fold�� f ��E�� which applies f to the elements ofE� It then uses �� the merge operation of some monoidM� to collect the result�In particular� if we have E ���M� f ���� M�� ��M�M� M� we de�ne

fold���e�f�e���E� �Mfe� j e� E� e� � f�e�g

The results type is M�The semijoin E�npE� is a join variant that delivers only those left operand

objects that have at least one join partner with respect to the predicate p�Implementation can be e�cient in terms of space �bu�ers or �les that hold theintermediate result only carry E� objects� and time �as soon as any join partneris found for an E� object� it belongs to the result� no other E� object has to betouched�� Let E� �� �M� E� �� N � p �� �� � bool� then we de�ne

E� n�e���e��p�e��e��

E� �Mfe� j e� � E�� somefp�e�� e�� j e� � E�gg

The aforementioned nestjoin� E��f �p�AE� is a non�standard join between E�

and E� in the sense that for each E� object we create a group A of E� objectsthat match the join predicate p� On creation of this group we apply f � Typesare as follows� E� ���M� E� ��N � f ���� � � p ���� � bool� We then have

E� ��e���e��f�e��e���

�e���e��p�e��e���A

E� �Mfe� hA � Nff�e�� e�� j e� � E��

p�e�� e��gi j e� � E�g

The result is of type ��hA ��NiM� � is valuable when translating queries withnesting in the select clause� as well as for the evaluation of complex predicatesbetween select blocks� E�cient techniques for the evaluation of grouping� e�g�sorting or indices� apply to the implementation of �� too�

��� Execution Strategies for Monoid Comprehensions

Achieving good execution plans for the monoid calculus is not as obvious asfor algebras� Provided we believe the calculus to be a valuable approach to

object SQL query processing� we face the problem of how to �nd good QEPsfor potentially deeply nested comprehension expressions� In what follows we willpresent a mapping� A � from the calculus to a hybrid representation employingthe monoid�aware nested algebra� which we assume to be the target language ofthe query execution engine�

Successive applications of A will replace monoid calculus terms by algebraicequivalents� The intermediate results of A are subject to algebraic and calculus�based rewriting� to the results of which A is applied again� Since any calculusterm at least has its canonical fold representation we are guaranteed to have away out of this process� The rewriting �nally yields a term of the target algebrawe introduced� i�e� a program for our query engine� The degrees of freedom inthis process de�ne the dimensions of the search space along which we look for theoptimal representation of our query� calculus � calculus� calculus � algebra�and the classical algebra � algebra�

After introducing the ideas of A we will present the derivation of executionplans in this framework by example� including the problematic queries Q� andQ� from Sect� �� Let us start with a �rst basic version A � of the mapping andimprove A as we go�

A � � Comprehensions Implemented by fold� The initial mapping A � putsthe comprehensions down to their nature of being monoid homomorphisms�which� in turn� are implemented by the algebras fold operator �see Fig� ��� A � isde�ned via structural recursion over the list of quali�ers� Constants �Rule �� anddatabase variables ���� e�g� globally named collections� provide the base cases forthe recursion� Rules ������� reduce the quali�er list from its head on� Rule ���introduces a fold over a generators range� Predicates are handled via a condi�tonal ��� that replaces non�qualifying elements by Ms identity zeroM�� whichdoes not contribute to the result�

A � �c� � c ���

A � �v� � v ���

A � �Mff j g� � unit�M��A � �f�� ���

A � �Mff j e� E� qg� � fold�merge�M�� �e�A � �Mff j qg���A � �E�� ���

A � �Mff j p� qg� � if A � �p� then A � �Mff j qg� else zero�M� ��

Fig� �� Mapping A � � Implement comprehensions by nested folds�

Note that it would be su�cient for A � s target query engine to implementfold� the respective monoid operations� and a conditional to execute the resultingplans� While this at least provides a complete mapping from comprehensions tothe algebra� we always end up with nested applications of fold because of ����

i�e� nested�loop processing� Consider the result of A � s application to T�Q�� inwhich the inner fold implementing the existential quanti�cation is executed foreach e � E�

�A � � T��Q��

� A � �setfe j e� E� somefN � setfe� xg j x�Mgg�

����

fold����e�A � �setfe j somefN � setfe� xg j x�Mgg���A � �E��

����

fold����e�if A � �somefN � setfe� xg j x�Mg� then A � �setfe j g�else fg��E�

����

fold����e�if fold��� �x�A � �somefN � setfe� xg j g���M� then fA � �e�gelse fg��E�

����

fold����e�if fold��� �x�N � fe� xg��M� then feg else fg��E�

A � fails to employ the more e�cient algorithms of the query engine�However� since the query engines operators have been put down to monoid

calculus terms� we have the possibility to incorporate join strategies into themapping�

A � � Reduction by Pattern Matching� The extended version A � under�stands the algebra operator de�nitions as patterns to be detected in the com�prehension expressions� On a successful match� A � replaces the match by itsequivalent algebraic term� We derive an initial A � by adopting the rules of A �and adding the cases of Fig� � which improve the processing of predicates andquanti�ers� More than one rule may match at a time and any rule may thenbe selected arbitrarily� However� we listed the rules in the order an optimizerheuristic should try to match� e�g� we prefer to implement universal quanti�ca�tion by n � � before resorting to a � with a nested fold as its predicate �����Note that ���� covers ���� we can therefore drop ��� and the if�then�else condi�tional completely� Since the n is feasible only if no dependencies �besides the joinpredicate� between the join partners exist� we check this condition by examiningthe set of free variables z���� Some rules introduce new generator variables andwe adapt the comprehension head accordingly� mainly by renaming variables�thus turning f into f � in ��� and ����� Finally� the comprehension pattern of���� serves as a catch�all case and implements the application of f via map asdiscussed in Sect� ����

If we turn back to query Q� again� we now rewrite as follows�

A � �setfe j e� E� somefN � setfe� xg j x�Mgg�

���A � �setfe

� j e� � E n�e��x�A� �N�setfe�xg�

Mg�

����

fold��� �e��A � �setfe� j g���E n

�e��x�N�fe�xgM�

����

fold��� �e��fe�g��E n�e��x�N�fe�xg

M�

in which we can drop the outer fold if E �� � set �the fold merely implements atype coercion from the semijoins output type to � set��

A � �Mff j q� e� E�somefp j q�gg�

� A � �Mff j q� e� A � �E� nA� �p�

A � �q��g�

if e �� z�q�� ���

A � �Mff j q� e� E�allfp j q�gg�

� A � �Mff j q� e� A � �E� n�A� �p�

A � �q��g�

if e �� z�q�� ���

A � �Mff j q�e� � E��e� � E�g�

� A � �Mff � j q� e� A � �E�� A � �E��g�

if e� �� z�E�� ���

A � �Mff j q� p�� p�g� � A � �Mff j q� p� p�g� ���

A � �Mff j q� pg� � A � �Mff � j e� ��A � �p���A � �q��g� ����

A � �Mff j qg� � map�A � �f���A � �q�� ����

Fig� �� Mapping A � � Operator pattern matching�

Figure � only shows a subset of the rewriting rules� since each algebra op�erator introduces at least one rule according to its de�nition� See #� for thecomplete A � rewrite rule base� An actual optimizer based on our approach will�however� certainly feature a larger set of rules�

Up to now� we have improved the calculus �� algebra mapping� By addingthe rules of Fig� � we extend the translation scheme once more and introducethe calculus �� calculus �i�e� comprehension level� rewriting dimension� While���� and ���� allow for selection pushdown and join reordering� respectively����� unnests a comprehension �over monoid N �� The binding e f � makes e asynonym for f � in the comprehensions head� This last rule� together with ����and ����� serves to remove dependencies between potential join partners� thusincreasing the chance of introducing e�cient joins and abandoning nested loopprocessing�

By translating algebra operators� aggregations� quanti�ers� and predicatesinto their comprehension equivalents� we uncover their control structure� E�g��we may view N � M as some atomic set predicate for which we expect acorresponding query engine procedure� but as well may carry out the subsettest explicitly� allfsomefy � z j z �Mg j y � Ng� This opens possibilitiesfor join pattern detection or loop fusion� The latter is not feasible if we viewthe operators only as black box functions that ful�ll certain abstract algebraicproperties� The translation of Q� backs up this claim� Uncovering � makes anantijoin applicable and we are actually able to deduce a pure algebraic plan�

A � �Mff j q� e� E� pg� � A � �Mff j q� p� e� Eg�

if e �� z�p� ����

A � �Mff j q� e� � E�� e� � E�g� � A � �Mff j q� e� � E��e� � E�g�

if e� �� z�E�� ����

A � �Mff j q� e�Nff � j q�gg� � A � �Mff j q� q�� e � f �g�

if N �M ����

A � �Mff j q� somefp j q�g� q�g� � A � �Mff j q� q�� p� q�g�

if merge�M� is idempotent ���

A � �Mff j q� e� zero�N �� q�g� � zero�M� ����

A � �Mff j q� e� unit�N ��e��� q�g� � A � �Mff j q� e � e�� q�g� ����

A � �Mff j q� e� merge�N ��E�� E��� q�g� � merge�M��A � �Mff j q� e� E�� q�g��A � �Mff j q� e� E�� q�g��

if M is commutative ����

Fig� �� A � extended� Comprehension level rewriting�

A � �setfe j e� E� somefallfsomefy � z j z � fe� xgg j y � Ng j x�Mgg�

�����

A � �setfe j e� E� x�M� allfsomefy � z j z � fe� xgg j y � Ngg�

����

A � �setfe j e� E� x�M� allfy � e � y � x j y � Ngg�

���

A � �setfe j e� � E M� e � e�E� x � e�M � allfy � e � y � x j y � Ngg�

����

A � �setfe��E j e�� � �E M� n�e���y�y ��e�

E�y ��e�

M

Ng�

�����

�E��E M� n�e���y�y ��e�

E�y ��e�

M

N�

In the above� a subscripted variable eM denotes es value projected onto the�elds l�� � � � � ln if M �� hl� �� ��� � � � � ln �� �ni C for any collection constructor C�record projection��

To conclude the example rewritings� let us turn back to query Q� and trackthe rewriting steps that an optimizer equipped with the complete rule base couldundertake� Recall� that Q� �attened E� a bag of collections� and then added allelements being greater than zero to obtain the �nal result�

�A � � T�

��sum�select x

from flatten�E� as xwhere x � ��

�A

��T�

A � �sumfe j e� bagfA � �x� j x� A � �flatten�E�� A � �x � ��gg�

��flatten�

A � �sumfe j e� bagfx j x� bagfz j z� � E� z � z�g� x � �gg�

�����

A � �sumfe j e� ���x�x � ���bagfz j z� � E� z � z�g�g�

��A� �

A � �sumfe j e� ���x�x � ���fold� � id��E��g�

��A� �

fold��� id�����x�x � ���fold� � id��E���

In principle� the resulting program performs three loops to accomplish itstask� Loop fusion techniques� however� allow us to transform the query intoa program that only needs two loops� Instead of �attening� each collection inE is summed up� mapping elements less than or equal to zero to !s iden�tity �� giving a bag of intermediate sums� A second outer loop then sumsup� fold!��e�fold!��x�if x � � then x else ���e���E�� This superior alternativemay directly be derived by A � � i�e� actually is element of our optimizers solutionspace�

Since ! is a �rst�class operator in the monoid framework and its algebraicproperties are well known �esp� having identity ��� we can apply loop fusionhere� Lifting any type onto the monoid level makes the operations of these typestransparent and therefore subject to further analysis and optimization� This isa consequence of the basic requirement �a� we stated at the very beginning� Acomplete representation of the query languages types and operations in whichno black boxes� remain has to be aspired�

A basic rewriting strategy for an optimizer based on the A � rules would tryto unnest nested comprehensions �i�e� subqueries� using ����� ����� and �����The goal is to remove variable dependencies that prevent the introduction ofjoins� �#��� discuss a normal form for monoid comprehensions that realizes aminimal number of such dependencies� We move along the calculus �� calculusdimension this way� Then� moves in the calculus �� algebra dimension are carriedout in order to introduce joins and the other algebra operators� Rules ��������implement such moves� This is based on the assumption that the query engineknows about e�cient implementations for these operators� All these rewritingsteps may be interleaved by algebraic as well as comprehension level rewriting�To obtain a program for the query execution engine� all remaining calculus termsare converted into folds as a �nal step�

� Conclusions and Further Work

Facing todays object SQLs�� in particular OQL� this paper proposed to let themonoid play the role sets played in the relational setting� The monoid calculus

has been shown to provide the expressiveness that is needed to completely cap�ture OQL� In order to obtain e�cient execution plans for algebra�tailored queryengines beyond nested�loop processing� we introduced a monoid�aware fold al�gebra� A � � T implemented a rewrite�based mapping from the source query lan�guage OQL to this target algebra� employing non�standard joins� This hybridapproach� combining algebra and calculus� takes advantage of the strengths ofboth formalisms during the derivation of query engine programs�

Some issues are still open� Most importantly� a rewriting strategy based onA � and its successors has to be developed� Moves in the three rewriting di�mensions have to be coordinated� Should the algebra be extended to cope withfurther particular nesting cases� in the spirit of �" Which in�uence does foldhave on the construction of a suitable query engine" Additionally� incorporat�ing the calculus into the evaluation framework opens new possibilities to exploitfurther program transformation techniques� like loop fusion� thus widening theview of database query optimization� Advanced query optimization techniques�like bypass processing for disjunctive queries� have to be adapted to �t the hy�brid approach� ��� already took a step in this direction� Our CROQUE projectis underway implementing the setup we discussed in this paper� Although basedon a subset of rules of A � and a limited set of algebra operators only� �rst testsshowed promising translation results for a broad range of OQL queries�

References

�� Francois Bancilhon� Ted Briggs� Setrag Khosha�an� and Patrick Valduriez� FAD� aPowerful and Simple Database Language� In Peter M� Stocker� William Kent� andPeter Hammersley� editors� Proceedings of the ��th Int�l Conference on Very LargeDatabases �VLDB�� pages ������ Morgan Kaufmann Publishers� September �����

�� Catriel Beeri and Yoram Kornatzky� Algebraic Optimization of Object�OrientedLanguages� In Paris C� Kanellakis and Serge Abiteboul� editors� Proceedings ofthe �rd Int�l Conference on Database Theory �ICDT�� volume ��� of LNCS� pages������ Springer Verlag� December �����

�� Peter Buneman� Leonid Libkin� Dan Suciu� Val Tannen� and Limsoon Wong� Com�prehension Syntax� In SIGMOD Record� volume ��� pages ������ March �����

�� Rick G� Cattell� editor� The Object Database Standard� ODMG��� chapter ��Object Query Language �Release ����� pages ����� Morgan Kaufmann Publishers����� Updates to Release ����

� Daniel Chan� Objectoriented Language Design and Processing� PhD thesis� Com�puter Science Department� University of Glasgow� �����

�� Sophie Cluet and Guido Moerkotte� Nested Queries in Object Bases� In Proceedings of the �th Int�l Workshop on Database Programming Languages �DBPL��September �����

�� Leonidas Fegaras� E�cient Optimization of Iterative Queries� In Proceedings of the�th Int�l Workshop on Database Programming Languages �DBPL�� pages �������September �����

�� Leonidas Fegaras and David Maier� Towards an E�ective Calculus for ObjectQuery Languages� In Proceedings of the �� ACM SIGMOD Int�l Conference onManagement of Data� May ����

�� Torsten Grust� Joachim Kr�oger� Dieter Gluche� Andreas Heuer� and Marc H� Scholl�Query Evaluation in CROQUE Calculus and Algebra Coincide� Technical Reportin preparation� Department of Mathematics and Computer Science� Database Re�search Group� University of Konstanz� April �����

��� Torsten Grust and Marc H� Scholl� Translating OQL into Monoid Comprehensions� Stuck with Nested Loops Technical Report �a������ Department of Mathe�matics and Computer Science� Database Research Group� University of Konstanz�September �����

��� Won Kim� On Optimizing an SQL�like Nested Query� ACM Transactions onDatabase Systems� ������������ September �����

��� Theodore W� Leung� Gail Mitchell� Bharathi Subramanian� Stanley B� Zdonik�and other� The AQUA Data Model and Algebra� In Proceedings of the �th Int�lWorkshop on Database Programming Languages �DBPL�� September �����

��� Alexandra Poulovassilis and Carol Small� Investigation of Algebraic Query Opti�misation for Database Programming Languages� In Proceedings of the �th Int�lConference on Very Large Databases �VLDB�� pages ������� Santiago� Chile�September �����

��� Christian Rich� Arnon Rosenthal� and Marc H� Scholl� Reducing Duplicate Work inRelational Join�s� A Uni�ed Approach� In Proceedings of the Int�l Conference onInformation Systems and Management of Data �CISMOD�� pages ������� Delhi�India� October �����

�� Hennie J� Steenhagen� Peter M�G� Apers� Henk M� Blanken� and Rolf A� de By�From Nested�Loop to Join Queries in OODB� In Proceedings of the �th Int�lConference on Very Large Databases �VLDB�� pages �������� Santiago� Chile�September �����

��� Phil Trinder� Comprehensions� a Query Notation for DBPLs� In Proceedings of theThird Int�l Workshop on Database Programming Languages �DBPL�� pages ����Nafplion� Greece� �����

��� Bennet Vance� Towards an Object�Oriented Query Algebra� Technical Report������� Oregon Graduate Institute of Science ! Technology� January �����

��� David A� Watt and Phil Trinder� Towards a Theory of Bulk Types� TechnicalReport FIDE������� Computer Science Department� University of Glasgow� �����

��� Limsoon Wong� Normal Forms and Conservative Properties for Query Languagesover Collection Types� In Proceedings of the � th ACM Symposium on Principlesof Database Systems� pages ������ May �����