algorithms - portland state university

1

Algorithms and Applications for Universal Quanti�cation in Relational

Databases

Ralf Rantzaua, Leonard D. Shapirob, Bernhard Mitschanga and Quan Wangc

a Computer Science Department,University of Stuttgart,Breitwiesenstr. 20{22, 70565 Stuttgart, Germany

b Computer Science Department,Portland State University,P.O. Box 751, Portland, OR 97201{0751, Oregon, USA

c Oracle Corporation,Portland, Oregon, USA

Queries containing universal quanti�cation are used in many applications, including business intelligence ap-plications and in particular data mining. We present a comprehensive survey of the structure and performanceof algorithms for universal quanti�cation. We introduce a framework that results in a complete classi�cation ofinput data for universal quanti�cation. Then we go on to identify the most eÆcient algorithm for each such class.One of the input data classes has not been covered so far. For this class, we propose several new algorithms.Thus, for the �rst time, we are able to identify the optimal algorithm to use for any given input dataset.These two classi�cations of optimal algorithms and input data are important for query optimization. They

allow a query optimizer to make the best selection when optimizing at intermediate steps for the quanti�cationproblem.In addition to the classi�cation, we show the relationship between relational division and the set containment

join and we illustrate the usefulness of employing universal quanti�cations by presenting a novel approach forfrequent itemset discovery.

1. Introduction

Universal quanti�cation is an important oper-ation in the �rst order predicate calculus. Thiscalculus provides existential and universal quan-ti�ers, represented by 9 and 8, respectively. Auniversal quanti�er that is applied to a variable xof a formula f speci�es that the formula is true forall values of x. We say that x is universally quan-ti�ed in the formula f , and we write 8x : f(x) incalculus.In relational databases, universal quanti�ca-

tion is implemented by the division operator (rep-resented by �) of the relational algebra. The divi-sion operator is important for databases becauseit appears often in practice, particularly in busi-ness intelligence applications, including online an-

alytic processing (OLAP) and data mining. Inthis paper, we will focus on the division operatorexclusively.Several algorithms have been proposed to im-

plement relational division eÆciently. These al-gorithms are presented in an isolated mannerin the research literature|typically, no relation-ships are shown between them. Furthermore,each of these algorithms claims to be superior toothers, but in fact each algorithm has optimalperformance only for certain types of input data.

1.1. The Division Operator

To illustrate the division operator we will use asimple example throughout the paper, illustratedin Figure 1, representing data from a CS depart-ment at a university [8]. A course row repre-

2

sents a course that has been o�ered by the de-partment and an enrollment row indicates that astudent has taken a particular course. The fol-lowing query can be represented by the divisionoperator:

\Which students have taken allcourses o�ered by the department?"

As indicated in the table result, only Bob hastaken all the courses. Bob is enrolled in anothercourse (Graphics) but this does not a�ect the re-sult. Both Alice and Chris are not enrolled inthe Databases course. Therefore, they are notincluded in the result.The division operator takes two tables for its

input, the divisor and the dividend, and generatesone table, the quotient. All the data elements inthe divisor must appear in the dividend, pairedwith any element (such as Bob) that is to appearin the quotient.In the example of Figure 1, the divisor and quo-

tient have only one attribute each, but in general,they may have an arbitrary number of attributes.In any case, the set of attributes of the dividendis the disjoint union of the attributes of the di-visor and the quotient. To simplify our exposi-tion, we assume that the names of the dividendattributes are the same as the corresponding at-tribute names in the divisor and the quotient.

1.2. Outline of the Paper

The remainder of this paper is organized as fol-lows. In Section 2, we present a classi�cation ofinput data for algorithms that evaluate divisionwithin queries. Section 3 gives an overview ofknown and new algorithms to solve the univer-sal quanti�cation problem and classi�es them ac-cording to two general approaches for division. InSection 4, we evaluate the algorithms accordingto both applicability and e�ectiveness for di�er-ent kinds of input data, based on a performanceanalysis. In Section 5, we discuss the relationshipbetween relational division and the set contain-ment join. Section 6 illustrates a new approachto exploit division and set containment join todiscover frequent itemsets. Section 7 gives anoverview of related work. Section 8 concludes thepaper and comments on future work.

enrollment

student id course id

Alice Compilers

Alice Theory

Bob Compilers

Bob Databases

Bob Graphics

Bob Theory

Chris Compilers

Chris Graphics

Chris Theory(a) Dividend

course

course id

Compilers

Databases

Theory(b) Divisor

result

student id

Bob(c) Quotient

Figure 1. enrollment � course = result, repre-senting the query \Which students have taken allcourses?"

2. Classi�cation of Data

This section presents an overview of the inputdata for division. We identify all possible classesof data based on whether it is grouped on cer-tain attributes. For some of these classes, we willpresent eÆcient algorithms in Section 3 that ex-ploit the speci�c data properties of a class.

2.1. Input Data Characteristics

The goal of this paper is to identify optimalalgorithms for the division operator, for all pos-sible inputs. Several papers compare new algo-rithms to previous algorithms and claim superi-ority for one or more algorithms, but they do notaddress the issue of which algorithms are optimalfor which types of data [3, 4, 8]. In fact, the per-formance of any algorithm depends on the struc-ture of its input data.If we know about the structure of input data,

we could employ an algorithm that exploits thisstructure, i.e., the algorithm does not have to re-structure the input before it can start generatingoutput data. Of course, there is no guaranteethat such an algorithm is always \better" than

3

an algorithm that requires previous restructur-ing. However, the division operator o�ers a vari-ety of alternative algorithms that can exploit sucha structure for the sake of good performance andlow memory consumption.Suppose we are fortunate and the input data is

highly structured. For example, suppose the datahas the schema of Figure 1 but is of much largersize, and suppose:

� enrollment is sorted by student id andcourse id and resides on disk, and

� course is sorted by course id and resides inmemory.

Then the example query can be executed withone scan of the enrollment table. This is accom-plished by reading the enrollment table from disk.As each student appears, the course id values as-sociated with that student are merged with thecourse table. If all courses match, the student idis copied to the result.The single scan of the enrollment table is ob-

viously the most eÆcient possible algorithm inthis case. In the remainder of this paper, wewill describe similar types of structure for inputdatasets, and the optimal algorithms that are as-sociated with them. The notion of \optimality"will be further discussed in the next section.Revisiting our example in Figure 1, how could

this careful structuring of input data, such assorting by student id and course id, occur? Itcould happen by chance, or for two other morecommonly encountered reasons:

1. The data might be stored in tables, whichwere sorted in that order for other purposes,for example, so that it is easy to list enroll-ments on a roster in ID order, or to �ndcourse information when a course ID num-ber is given.

2. The data might have undergone some pre-vious processing, because the division oper-ator query is part of a more complex query.The previous processing might have beena merge-join operator, for example, whichrequires that its inputs be sorted and pro-duces sorted output data.

2.2. Choice of Algorithms

A query processor of a database system typi-cally provides several algorithms that all realizethe same operation. An optimizer has to chooseone of these algorithms to process the given data.If the optimizer knows the structure of the inputdata for an operator, it can pick an algorithmthat exploits the structure. Many criteria in u-ence the decision why one algorithm is preferredover others. Some of these choice criteria are:the time to deliver the �rst/last result row, theamount of memory for internal, temporary datastructures, the number of scans over the inputdata, or the ability to be non-blocking, i.e., to re-turn some result rows before the entire input dataare consumed.Which algorithm should we use to process the

division operation, given the dividend and divisortables shown in Figure 1? Several algorithms areapplicable but they are not equally eÆcient. Forexample, since the dividend and divisor are bothsorted on the attribute course id in Figure 1, wecould select a division algorithm that exploits thisfact by processing the input tuples in a way thatis similar to the merge-join algorithm, as we havesketched in the previous section.What algorithm should we select when the in-

put tables are not sorted on course id for eachgroup of student id? One option is to sort bothinput tables �rst and then employ the algorithmsimilar to merge-join. Of course, this incurs anadditional computational cost for sorting in ad-dition to the cost of the division algorithm itself.Another option is to employ an algorithm that isinsensitive to the ordering of input tuples. Onesuch well-known algorithm is hash-division and isdiscussed in detail in Section 3.3.4.We have seen that the decision, which algo-

rithm to select among a set of di�erent divisionalgorithms, depends on the structure of the inputdata. This situation is true for any class of algo-rithms, including those that implement databaseoperators like join, aggregation, and sort algo-rithms.It is possible that division is only a portion of a

larger query that contains many additional queryparts. Hence, the input of a division operation isnot restricted to base tables, like in the example

4

of Figure 1, but it can be derived tables whichare the result of another operation like a join, forexample. Furthermore, the output of the divi-sion could be an intermediate result itself that isfurther processed within the query. For example,the quotient table result in Figure 1 could be theinput of an aggregation that counts the numberof students. The meaning of the resulting aggre-gate is the number of students who have taken allcourses of the department. Alternatively, the re-sult in Figure 1 could be an input of a join with atable student(student id;name; address; : : :) to re-trieve a student's name, address, etc. instead of ameaningless ID. Thus, the result table producedby the selected division algorithm can have cer-tain data properties that in uence the choice ofadditional algorithms, here a join, that are usedto process the overall query.

2.3. Grouping

Relational database systems have the notion ofgrouped rows in a table. Let us brie y look at anexample that shows why grouping is importantfor query processing. Suppose we want to �nd foreach course the number of enrolled students in theenrollment table of Figure 1. One way to computethe aggregates involves grouping: after the tablehas been grouped on course id, all rows of the ta-ble with the same value of course id appear nextto each other. The ordering of the group values isnot speci�ed, i.e., any group of rows may followany other group. Group-based aggregation groupsthe data �rst, and then it scans the resulting ta-ble once and computes the aggregates during thescan.Another way to process this query is nested-

loop aggregation. We pick any course ID as the�rst group value and then search through thewhole table to �nd the rows that match this IDand compute the sum. Then, we pick a secondcourse ID, search for matching rows, compute thesecond aggregate, pick the third value, etc. If nosuitable search data structure (index) is available,this processing may involve multiple scans overthe entire dataset.The aggregation step of the group-based ap-

proach is obviously more eÆcient than the sec-ond approach because it can make an assumption

about some ordering of the rows. However, themore eÆcient processing is paid with the overheadof the preceding grouping.When a table is to be grouped on a list

(a1; : : : ; an) of more than one attribute, the re-sult is equal to grouping on a single attribute inan iterative way: We �rst group on a1, then foreach subset of rows de�ned by a1, we group ona2, and for each such subset determined by a2,we group on a3, etc. Hence, if we want to com-pare two tables that are grouped on the same setof attributes, we have to be aware of the attributelist ordering, because the resulting grouped tablehas a di�erent structure for each ordering. Thisfact is important for division when we match someof the dividend's divisor attributes with all of thedivisor's attributes.Sorted data appears frequently in query pro-

cessing. Note that sorting is a special groupingoperation. For example, grouping only requiresthat students enrolled in the same course arestored next to each other (in any order), whereassorting requires more e�ort, namely that they bein a particular order (ascending or descending).The overhead of sort-based grouping is re ectedby the time complexity O(n logn) as opposedto the nearly linear time complexity for hash-based grouping. Though sort-based grouping al-gorithms do more than necessary, both hash-based and sort-based grouping perform well forlarge datasets [7, 8].

2.4. Grouped Input Data for Division

Relational division has two input tables, a div-idend and a divisor, and it returns a quotient ta-ble. As a consequence of the de�nition of the di-vision operator, we can partition the attributesof the dividend S into two sets, which we de-note D and Q, because they correspond to theattributes of the divisor and the quotient, respec-tively. The divisor's attributes correspond to D,i.e., for each attribute in the divisor there is adi�erent attribute in D of the same domain. Asalready mentioned, for simplicity, we assume thatthe names of attributes in the quotient R are thesame as the corresponding attribute names in thedividend S and the divisor T . Thus, we write adivision operation as R(Q) = S(Q[D)�T (D). In

5

Figure 1, Q = fstudent idg and D = fcourse idg.Our classi�cation of division algorithms is

based on whether certain attributes are groupedor even sorted. Several reasons justify this deci-sion. Grouped input can reduce the amount ofmemory needed by an algorithm to temporarilystore rows of a table because all rows of a grouphave a constant group value. Furthermore, group-ing appears frequently in query processing. Manydatabase operators require grouped or sorted in-put data (e.g., merge-join) or produce such out-put data (e.g., index-scan): If there is an indexde�ned on a base table, a query processor can re-trieve the rows in sorted order, speci�ed by theindex attribute list. Thus, in some situations al-gorithms may exploit for the sake of eÆciency thefact that base tables or derived tables are groupedif the system knows about this fact.In Table 1, we show all possible classes of in-

put data based on whether or not interesting at-tribute sets are grouped, i.e., grouped on one ofQ, D, or the divisor. As we will see later inthis paper, some classes have no suitable algo-rithm that can exploit its speci�c combination ofdata properties. The classes that have at leastone algorithm exploiting exactly its data prop-erties are shown in bold font. In class 0, forexample, no table is grouped on an interestingattribute set. Algorithms for this class have tobe insensitive to whether the data is grouped ornot. Another example scenario is class 10. Here,the dividend is �rst grouped on the quotient at-tributes Q (denoted by G1, the major group) andfor each group, it is grouped on the divisor D(denoted by G2, the minor group). The divisoris grouped in the same ordering (G2) as the divi-dend.Our classi�cation is based on grouping only. As

we have seen, some algorithms may require thatthe input is even sorted and not merely grouped.We consider this a minor special case of our clas-si�cation, so we do not re ect this data propertyin Table 1, but the algorithms in Section 3 willrefer to this distinction. We do not consider anydata property other than grouping in this paperbecause our approach is complete and can easilyand e�ectively be exploited by a query optimizerand query processor.

enrollmentstudent id course id

Bob TheoryAlice CompilersChris TheoryChris GraphicsAlice TheoryBob GraphicsChris CompilersBob DatabasesBob Compilersnot grouped not grouped

coursecourse id

DatabasesTheoryCompilersnot grouped

(a) Class 0


Alice TheoryChris TheoryBob TheoryBob DatabasesBob GraphicsChris GraphicsBob CompilersChris CompilersAlice Compilersnot grouped grouped

coursecourse id


(b) Class 2


Chris GraphicsChris CompilersChris TheoryAlice TheoryAlice CompilersBob TheoryBob CompilersBob DatabasesBob Graphicsgrouped not grouped

coursecourse id


(c) Class 5


Chris TheoryChris GraphicsChris CompilersAlice TheoryAlice CompilersBob DatabasesBob TheoryBob GraphicsBob Compilersgrouped grouped

coursecourse id

DatabasesTheoryCompilersgrouped

(d) Class 10

Figure 2. Four important classes of input data,based on the example of Figure 1

Figure 2 illustrates four classes of input datafor division, based on the example data of Figure1. These classes, which are shown in bold font

6

Class Dividend Divisor Description of GroupingQ D

0 N N N1 N N G2 N G N3 N G1 G2 arbitrary ordering of groups in D and divisor4 N G1 G1 same ordering of groups in D and divisor5 G N N6 G N G7 G1 G2 N Q major, D minor8 G2 G1 N D major, Q minor9 G1 G2 G3 Q major, D minor; arbitrary ordering of groups in D and divisor10 G1 G2 G2 Q major, D minor; same ordering of groups in D and divisor11 G2 G1 G3 D major, Q minor; arbitrary ordering of groups in D and divisor12 G2 G1 G1 D major, Q minor; same ordering of groups in D and divisor

Table 1A classi�cation of dividend and divisor. Attributes are either grouped (G) or not grouped (N). We usethe same (a di�erent) subscript of G when D and the divisor have the same (a di�erent) ordering ofgroups in classes 3, 4, 9{12. In addition, when the dividend is grouped on both Q and D in classes 7{12,then G1 (G2) denotes the attributes that the table is grouped on �rst (second).

in Table 1, are important for several algorithmsthat we present in the following section. Noticethat for class 10 both tables are grouped in thesame order on course id. If the value \Graph-ics" is present in a quotient group then it alwaysappears after \Theory" and before \Compilers."Figure 1 shows another example instance of class10, where the quotient order as well as the divisorgroup order is ascending. The bene�t of knowingabout such an input data property will be clar-i�ed when we discuss algorithms exploiting thisspeci�c property in Sections 3.3.2 and 3.3.3.If we know that an algorithm can process data

of a speci�c class, it is useful to know whichother classes are also covered by the algorithm.This information can be represented, e.g., by aBoolean matrix like the one on the left in Fig-ure 3. One axis indicates a given class C1 andthe other axis shows the other classes C2 that arealso covered by C1. Alternatively, we can use adirected acyclic graph representing the input dataclassi�cation, sketched on the right of Figure 3.If a cell of the matrix is marked with \Y" (yes),or equivalently, if there is a path in the graph

0

12

3

4

5

6 78

9

10

11

12

Y2 13456789101112

Y1110

Y9YY8

YY7YY6

YYYY5Y4

YYY3YYYYY2YYYYYYY1

YYYYYYYYYYY0

Class C2

Cla

ssC

1

Figure 3. A matrix and a directed acyclicgraph representing the input data classi�cationdescribed in Table 1. All algorithms to be dis-cussed in Section 3 assume data properties of ei-ther class 0, 2, 5, or 10.

from class C1 to C2, then an algorithm that canprocess data of class C1 can also process data ofclass C2. The graph clearly shows that the clas-si�cation is a partial order of classes, not a stricthierarchy. The source node of the graph is class

7

Division Algorithm Abbrev.

Hash-Division HDHash-Division for Divisor Groups HDDHash-Division for Quotient Groups HDQMerge-Count Division MCDMerge-Group Division MGDMerge-Sort Division MSDNested-Loops Division NLDNested-Loops Counting Division NLCDTransposed Hash-Division HDTTransposed Hash-Division for Divisor Groups HDTDTransposed Hash-Division for Quotient Groups HDTQStream-Join Division SJD

Table 2Abbreviations for division algorithms

0, which requires no grouping of D, Q, or divisor.Any algorithm that can process data of class 0can process data of any other class. For example,an algorithm processing data of class 6 is able toprocess data of classes 9 and 10.For the subsequent discussion of division algo-

rithms, we de�ne two terms to refer to certainrow subsets of the dividend. Let the dividend Sbe grouped on Q (D) as the �rst or the only setof group attributes, i.e., let the dividend belongto class 5 (2) and all its descendants in Figure 3.Furthermore, let v be one speci�c value of such agroup. Then, the set of rows de�ned by �Q=v(S)(�D=v(S)) is called the quotient group (divisorgroup) of v. For example, in the enrollment tableof class 5 in Figure 2(c), the quotient group ofAlice consists of the rows f(Alice, Theory), (Al-ice, Compilers)g. Similarly, the divisor group ofDatabases in class 2 in Figure 2(b) consists of thesingle row (Bob, Databases).

3. Overview of Algorithms

In this section, we present algorithms for rela-tional division proposed in the database literaturetogether with several new variations of the well-known hash-division algorithm. For the sake ofa concise presentation, we will frequently use ab-breviations for the algorithms that we summarizein Table 2.In Section 4, we will analyze and compare the

e�ectiveness of each algorithm with respect to thedata classi�cation of Section 2.

3.1. Complexity of Algorithms

During the evaluation of relevant literature, wefound that it is necessary to clarify that each divi-sion algorithm (analogous to other classes of algo-rithms, like joins, for example) has performanceadvantages for certain data characteristics. Noalgorithm is able to outperform the others for ev-ery input data conceivable.The following algorithms assume that the divi-

sion's input consists of a dividend table S(Q;D)and a divisor table T (D), where Q is a set ofquotient attributes and D is the set of divisor at-tributes, as de�ned in Section 2.4.During the presentation of the algorithms, we

analyze the worst and typical case complexitiesof processing time and memory consumption inO-notation, based on the size (number of rows)of the dividend jSj and the size of the divisor jT j.We use jQj, the number of distinct values of quo-tient attributes Q in the dividend, for some algo-rithms to derive a complexity formula. Note thatalways jQj � jSj, and in the worst case jQj = jSj,i.e., each single row of S is a potential (candidate)quotient. To derive formulas for the typical timeand memory complexities, we use the assumptionthat jSj � jT j, i.e., there are many quotient can-didates and/or the number of rows of an typicalquotient candidate is much larger than the num-ber of divisor rows. We consider this situationas the typical case because relational division isde�ned to compute a set of result rows and in real-world scenarios this set is of considerable size. Alarge result size occurs only if the dividend con-tains many more rows than the divisor.In addition to time and memory complexity,

it is useful to analyze the I/O cost of each algo-rithm, as it has been done in detail for some of thefollowing algorithms in [8]. However, since the fo-cus of this paper is to describe the fundamentalstructure of input data and algorithms involvedin relational division, we restrict our analysis tomemory and processing complexities and we donot give I/O formulas.

3.2. Query Language Representation and

Algorithm Classi�cation

In this section, we show SQL expressions fordivision and explain how they give rise to two

8

classes of algorithms based on the kind of datastructures employed.The commonly used approach to express uni-

versal quanti�cation uses two \NOT EXISTS"clauses, exploiting the mathematical equivalence8x9y : f(x; y) � :9x:9y : f(x; y) as follows:

SELECT DISTINCT student_idFROM enrollment AS e1WHERE NOT EXISTS (

SELECT *FROM course AS cWHERE NOT EXISTS (

SELECT *FROM enrollment AS e2WHERE e2.student_id = e1.student_id AND

e2.course_id = c.course_id))

This query asks for each student, where thereis no course that the student is not enrolled in.The previous approach is not very intuitive

to formulate. Another way to express divisionqueries has been proposed in the past, using aspecial syntax for universal quanti�cation. Thequanti�er \FOR ALL," which is part of a so-called quanti�ed predicate [9], was planned to beincluded in the SQL:1999 standard [2] but it was�nally excluded for reasons unknown to the au-thors. We can phrase queries using the quanti�erfor division queries in an intuitive way. For exam-ple, the following SQL query employing a quan-ti�ed predicate is equivalent to the above query:

SELECT DISTINCT student_idFROM enrollment AS e1WHERE FOR ALL (SELECT *

FROM course AS c)(EXISTS (

SELECT *FROM enrollment AS e2WHERE e2.student_id = e1.student_id AND

e2.course_id = c.course_id))

This query asks for each student, where for allcourses there is an enrollment of this student.A query language syntax dedicated to universal

quanti�cation allows us to map the query directlyto a query execution that uses a division algo-rithm. It is nontrivial to map a query formulatedin an indirect way (e.g. by using nested negationsas in the �rst approach) to a query execution thatuses a division algorithm.There is a third way mentioned in the literature

that uses aggregation. The example query of Sec-tion 1.1 can be phrased in SQL using aggregationas follows:

old enrollment


Chris CompilersChris GraphicsChris TheoryAlice CompilersAlice TheoryBob CompilersBob DatabasesBob GraphicsBob Theory

(a) Original dividend

course

course id

DatabasesCompilersTheory

(b) Divisor

new enrollment


Chris CompilersChris TheoryAlice CompilersAlice TheoryBob CompilersBob DatabasesBob Theory

(c) Resulting dividend

Figure 4. Semi-join old enrollment n course =new enrollment, representing the preprocessing ofthe enrollment table for aggregate division algo-rithms, based on the example in Figure 1.

SELECT student_idFROM enrollmentGROUP BY student_idHAVING COUNT(DISTINCT course_id) = (

SELECT COUNT(DISTINCT course_id)FROM course)

Any query involving universal quanti�cationcan be replaced by a query that makes use ofcounting [8]. However, there is a problem withthis approach to express division because it is notequivalent to the previous two approaches. It re-turns the same result as the other queries onlyif two conditions are met. First, each course id(D) value in enrollment (the dividend) is also con-tained in the course table (the divisor). De�ninga foreign key enrollment.course id that referencescourse and enforcing referential integrity can ful-�ll this condition. Another way to guarantee ref-erential integrity is to preprocess the dividend bya semi-join of dividend and divisor. The semi-join returns all dividend rows whose D values arecontained in the divisor. Figure 4 illustrates thesemi-join for our university example in Figure 1.

9

The second condition of this approach requiresthat the course id (D) values and the divisor rowsare unique. Possible duplicates have to be re-moved before the division. Hence, the SQL queryabove contains the SQL keyword \DISTINCT"when counting course id values to avoid any du-plicates. Note that when the divisor is groupedon all of its attributes, each group consists of asingle row because of the required absence of du-plicate rows. The same is true for the dividend ifit is grouped on both Q and D, as in the classes7{12 in Table 1.We have seen that the two approaches actu-

ally realize two logical operators that give rise totwo classes of algorithms, aggregate and scalar.The scalar class of algorithms relies on direct rowmatches between the dividend's divisor attributesD and the divisor table. The second class, ag-gregate algorithms, use counters to compare thenumber of rows in a dividend's quotient group tothe number of divisor rows. In [3], scalar and ag-gregate algorithms are called direct and indirectalgorithms, respectively.Aggregate algorithms are often described as al-

ternative ways to scalar algorithms (for the realdivision operator) but they are prone to errorsbecause one has to take care of duplicates, NULLvalues, and referential integrity, as already men-tioned before.Some query languages for non-relational data

models also o�er support to express quanti�ca-tion. For example, there is \work in progress" bythe W3C on the Working Draft of XQuery [28],a query language for XML data. Universal quan-ti�cation can be expressed in XQuery by an everyexpression.

3.3. Scalar Algorithms

This section presents division algorithms thatuse data structures to directly match dividendrows with divisor rows.

3.3.1. Nested-Loops Division

This algorithm is the most na��ve way to imple-ment division. However, like nested-loops join, anoperator using nested-loops division (NLD) has norequired data properties on the input tables andthus can always be employed, i.e., NLD can pro-

cess input data of class 0 and thus any other classof data, according to Figure 3.We use two set data structures, one to store

the set of divisor values of the divisor table, calledseen divisors, and another to store the set of quo-tient candidate values that we have found so farin the dividend table, called seen quotients. We�rst scan the divisor table to �ll seen divisors. Af-ter that, we scan the dividend in an outer loop.For each dividend row, we check if its quotientvalue (Q) is already contained in seen quotients.If not, we append it to the seen quotients datastructure and scan the remainder of the dividenditeratively in an inner loop to �nd all rows thathave the same quotient value as the dividend rowof the outer loop. For each such row found, wecheck if its divisor value is in seen divisors. Ifyes, we mark the divisor value in seen divisors.After the inner scan is complete, we add the cur-rent quotient value to the output if all divisorsin seen divisors are marked. Before we start pro-cessing the next dividend row of the outer loop,we unmark all elements of seen divisors.Note that NLD can be very ineÆcient. For each

row in the dividend table, we scan the dividend atleast partially to �nd all the rows that belong tothe current quotient candidate. All divisor rowsand quotient candidate rows are stored in an in-memory data structure. NLD can be the mosteÆcient algorithm for small ungrouped datasets.This algorithm can make use of any set data

structure like hash tables or sorted lists to repre-sent seen divisors and seen quotients. Let us as-sume that this algorithm uses hash tables or anyvery eÆcient data structure with a (nearly) con-stant access time. Then, the worst case time com-plexity of this algorithm is O(jSj2 + jT j) and thetypical time complexity is O(jSj2). The memorycomplexity is O(jQj + jT j). Since in the extremecase jQj = jSj, the worst case memory complexityis O(jSj+jT j) and the typical memory complexityis O(jSj).The pseudo code of the nested-loops division al-

gorithm is shown in the Appendix. In that code,the seen divisors and seen quotients data struc-tures are represented by the divisor hash tabledht and the quotient hash table qht, respectively.Figure 5(g) illustrates the two hash tables

10

1

1Theory

Databases

Compilers

divisor hash table

divisorattribute

1

bit

quotient hash table

Bob

Alice

quotientattribute

(a) Nested-Loops Division (NLD)

CompilersChris

TheoryBobGraphicsBobDatabasesBobCompilersBobTheoryAliceCompilersAliceTheoryChrisGraphicsChris

course_idstudent_id

enrollment course

Compilers

TheoryDatabases

course_id

(b) Merge-Sort Division (MSD)

Databases

CompilersTheory

course_id

enrollment course

TheoryChris

CompilersBobGraphicsBobTheoryBobDatabasesBobCompilersAliceTheoryAliceCompilersChrisGraphicsChris

course_idstudent_id

(c) Merge-Group Division (MGD)

divisor hash table

Theory

Databases

Compilers

bitmap

1 1 1

1 0 1

0 1 21 0 1

Chris

Bob

Alice0

1

2

divisorattribute

divisornumber

quotient hash table

quotientattribute

(d) Hash-Division (HD)

quotient hash table

Chris

Bob

Alice 0

1

2

quotientattribute

quotientnumberbitmap

0 1 0

1 1 1

0 1 21 1 1

Theory

Databases

Compilers

divisor hash table

divisorattribute

(e) Transposed Hash-Division (HDT)

divisor hash table

Theory

Databases

Compilers 0

1

2

divisorattribute

divisornumber

current quotient candidate

bitmap

0 1 21 1 1Bob

quotientattribute

(f) Hash-Division for Quotient Groups (HDQ)

1

1Theory

Databases

Compilers

divisor hash table

divisorattribute

Bob

current quotient candidate

1

bit

(g) Transposed Hash-Division for Divisor Groups(HDTQ)

Figure 5. Overview of the data structures and processing used in scalar algorithms. The value settingis based on the example from Figure 1. Except for MSD and MGD, broken lined boxes indicate that aquotient is found.

used in this algorithm: the divisor/quotient hashtable represents seen divisors/seen quotients, re-spectively. The value setting in the hash tables

is shown for the time when all dividend rows ofAlice and Bob (in this order) have been processedand we have not yet started to process any rows

11

of Chris in the outer loop. We �nd that Bob is aquotient because all bits in the divisor hash tableare equal to 1.

3.3.2. Merge-Sort Division

The merge-sort division (MSD) algorithm as-sumes that

� the divisor T is sorted, and that

� the dividend S is grouped on Q, and foreach group, it is sorted on D in the sameorder (ascending or descending) as T .

This data characteristic is a special case of class10, where D and the divisor are sorted and notonly grouped.The algorithm resembles merge-join for pro-

cessing a single quotient group and is similar tonested-loops join for processing all groups. Letus brie y sketch the processing of rows within asingle group, assuming an ascending sort order.We begin with the �rst row of dividend and divi-sor. If the divisor value D of the current dividendrow and the divisor row match, we proceed withthe next row in both tables. If D is greater thanthe current divisor row, we scan forward to thenext quotient group. If D is less than the divisorrow, we proceed with the next row of the groupand the current divisor row. If there are no morerows to process in the quotient group but at leastone more row in the divisor, we skip the quotientgroup. If there are no more rows to process in thedivisor, we have found a quotient and add it tothe output table.Our merge-sort division is similar to the ap-

proach called na��ve division, presented in [8] andoriginating from [26]. In both approaches, wecan implement the scan of each input such thatit ignores duplicates. In contrast to merge-sortdivision, na��ve division explicitly sorts the databefore the merge step. Even worse, na��ve divi-sion does not merely group the dividend on Qbut sorts it, which is more than necessary. Notethat we view sorting or grouping as preprocessingactivities that are separate from the core divisionalgorithm. We sketch the pseudo code of merge-sort division without duplicate removal logic inthe Appendix.

The worst case time complexity of this algo-rithm is O(jSj + jQjjT j) = O(jSj + jSjjT j) =O(jSjjT j) because the dividend is scanned exactlyonce and from the divisor table, we fetch as manyrows as the number of quotient candidates timesthe number of divisor rows. The typical case timecomplexity is O(jSjjT j). The worst and typicalcase memory complexity is O(1), since only a con-stant number of small data structures (two rows)have to be kept in memory.Figure 5(b) illustrates the matches between

rows of dividend and divisor. Observe that thedata is not sorted but only grouped on student idin an arbitrary order.

3.3.3. Merge-Group Division

We can generalize merge-sort division to analgorithm that we call merge-group division(MGD). In contrast to MSD, we assume that

� both inputs are only grouped and not nec-essarily sorted on the divisor attributes, butthat

� the order of groups in each quotient groupand the order of groups in the divisor arethe same.

Note that each group within a quotient groupand within the divisor consists of a single row.This ordering can occur (or can be achieved) if,e.g., the same hash function is used for groupingthe divisor and each quotient group.In the MSD algorithm, we can safely skip a quo-

tient candidate if the current value of Q is greater(less) than that of the current divisor row, assum-ing an ascending (a descending) sort order. Sincewe do not require a sort order on these attributesin MGD, we cannot skip a group on unequal val-ues, as we do in MSD. For example, suppose thatthe divisor S has a single integer attribute andconsists of the following rows in the given order:S = (3; 1; 5) and the D values of the current quo-tient group G consists of the rows G = (2; 5; 4; 6).We can be sure that G is not a valid quotient onlyafter

� we have scanned the entire group G, wherewe �nd that the �rst element of S (3) is notcontained in G, or

12

� we have scanned S up to last element (5)and we have scanned G up to the secondelement (5) to �nd that G does not containthe other elements of S (3 and 1) beforeelement 5 appears.

The MGD approach makes use of a look-aheadof n divisor rows for some prede�ned value n � 1.As in the MSD approach, we compare the currentquotient group row with the current divisor row.In case of inequality, we look ahead up to then-th divisor row to see if there is any other rowmatching the current group row. If we �nd sucha match, we can skip the current quotient candi-date. In our example, a look-ahead of 2 meansthat we check up to the second element (1) ofthe divisor. The look-ahead of 2 does not helpfor any value of G in our example. A look-aheadof 3 means a check with up to the third divisorelement (5). When we check the second row (5)of the quotient group, we �nd a match with thethird divisor element (5). Here, we can skip thegroup because a quotient would have to containthe values 3 and 1 before the occurrence of 5 toqualify due to the assumption that the group or-ders are the same. In other words, the orderingassumption guarantees that the values 3 and 1cannot occur after the element 5. Since they haveneither occurred in G before element 5, we knowthat this quotient candidate does not contain alldivisor elements, in particular not the elements 3and 1.The MSD algorithm is a special case of MGD

where the look-ahead is set to 1 because it doesnot look further than the current row for eachquotient group row since sorting was applied.In summary, the MGD approach can make use

of as much look-ahead as the minimum of theavailable memory and the current divisor size.Note that the divisor �ts into memory in all rea-sonable cases. Figure 5(c) sketches the matchesbetween dividend and divisor rows. Observe thatthe order of (single-row) groups within each quo-tient group in the dividend is the same as that ofthe divisor.The time complexity of this algorithm is

O(jSj + jQjjT j) because the dividend is scannedexactly once and the divisor is scanned entirely

for each quotient and at least partially for everyquotient candidate. Thus, the worst case timecomplexity is O(jSj + jSjjT j) = O(jSjjT j). Thetypical case time complexity is also O(jSjjT j).The worst case memory complexity is O(jT j) if wekeep the entire divisor as a look-ahead in mem-ory. The typical case memory complexity thenbecomes O(1) since jT j � jSj.

3.3.4. Classic Hash-Division

In this section, we present the classic hash-division (HD) algorithm [8]. We call this algo-rithm \classic" to distinguish it from our varia-tions of this approach in the following sections.The two central data structures of HD are the

divisor and quotient hash tables, sketched in Fig-ure 5(d). The divisor hash table stores divisorrows. Each such row has an integer value, calleddivisor number, stored together with it. The quo-tient hash table stores quotient candidates andhas a bitmap stored together with each candi-date, with one bit for each divisor. The pseudocode of hash-division is sketched in the Appendix.In a �rst phase, hash-division builds the divi-

sor hash table while scanning the divisor. Thehash function takes the divisor attributes as anargument and assigns a hash bucket to each di-visor row. A divisor row is stored into the hashbucket only if it is not already contained in thebucket, thus eliminating duplicates in the divi-sor. When a divisor row is stored, we assign aunique divisor number to it by copying the valueof a global counter. This counter is incrementedfor each stored divisor row and is initialized withzero. The divisor number is used as an index forthe bitmaps of the quotient hash table.The second phase of the algorithm constructs

the quotient hash table while scanning the div-idend. For each dividend row, we �rst check ifits D value is contained in the divisor hash table,using the same hash function as before. If yes,we look up the associated divisor number, oth-erwise we skip the dividend row. In addition tothe look-up, we check if the quotient is alreadypresent in the quotient hash table. If yes, weupdate the bitmap associated with the matchingquotient row by setting the bit to 1 whose po-sition is equal to the divisor number we looked

13

up. Otherwise, we insert a new quotient row intothe quotient hash table together with a bitmapwhere all bits are initialized with zeroes and theappropriate bit is set to 1, as described before.Since we insert only quotient candidates that arenot already contained in the hash table, we avoidduplicate dividend rows.The �nal phase of hash division scans the quo-

tient hash table's buckets and adds all quotientcandidates to the output whose bitmaps containonly ones. In Figure 5(d), the contents of thehash tables are shown for the time when all divi-dend and divisor rows of Figure 1 have been pro-cessed. We see that since Bob's bitmap containsno zeroes, Bob is the only quotient, indicated bya broken lined box.Hash-division scans both dividend and divisor

exactly once. Because hash tables are employedthat have a nearly constant access time, this ap-proach has a worst and typical case time complex-ity of O(jSj + jT j) and O(jSj), respectively. Thememory complexity consists of O(jT j) to store thedivisor hash table plus O(jQjjSj) for the quotienthash table. The size of a bitmap is proportionalto jSj. Since the worst case scenario implies thatjQj = jSj, the total worst and typical case mem-ory complexity is O(jSjjT j).

3.3.5. Transposed Hash-Division

This algorithm is a slight variation of classichash-division. The idea is to switch the rolesof the divisor and quotient hash tables. Thetransposed hash-division (HDT) algorithm keepsa bitmap together with each row in the divisorhash table instead of the quotient hash table, asin HD. Furthermore, HDT keeps an integer valuewith each row in the quotient hash table insteadof the divisor hash table, as in the HD algorithm.Same as the classic hash-division algorithm,

HDT �rst builds the divisor hash table. However,we store a bitmap with each row of the divisor.A value of 1 at a certain bit position of a bitmapindicates which quotient candidate has the samevalues of D as the given divisor row.In a second phase, also same as HD, the HDT

algorithm scans the dividend table and builds aquotient hash table. For each dividend row, theDvalues are inserted into the divisor hash table as

follows. If there is a matching quotient row storedin the quotient hash table, we look up its quotientnumber. Otherwise, we insert a new quotient rowtogether with a new quotient number. Then, weupdate the divisor row's bitmap by setting the bitat the position given by the quotient number to1.The �nal phase makes use of a new, separate

bitmap, whose size is the same as the bitmapsin the divisor hash table. All bits of the bitmapare initialized with zero. While scanning the di-visor hash table, we apply a bit-wise AND op-eration between each bitmap contained and thenew bitmap. The resulting bit pattern of thenew bitmap is used to identify the quotients. Thequotient numbers (bit positions) with a value of1 are then used to look up the quotients using aquotient vector data structure that allows a fastmapping of a quotient number to a quotient can-didate. The HDT pseudo code is shown in theAppendix.Figures 5(d) and (e) contrast the di�erent

structure of hash tables in HD and HDT. Thehash table contents is shown for the time whenall enrollment rows of Figure 1 have been pro-cessed. While a quotient in the HD algorithmcan be added to the output when the associatedbitmap contains no zeroes, the HDT algorithmrequires a match of the bit at the same positionof all bitmaps in the divisor table and it requiresin addition a look-up in the quotient hash tableto �nd the associated quotient row.The time and memory complexities of HDT are

the same as those of classic hash-division.

3.3.6. Hash-Division for Quotient Groups

Both, classic and transposed hash-division canbe improved if the dividend is grouped on ei-ther D or Q. However, our optimizations basedon divisor groups lead to aggregate, not scalaralgorithms. Hence, this section on scalar algo-rithms presents some optimizations for quotientgroups. The optimizations of hash-division fordivisor groups are presented in Section 3.4.3.Let us �rst focus on classic hash-division. If

the dividend is grouped on Q, we do not need aquotient hash table. It suÆces to keep a singlebitmap to check if the current quotient candidate

14

is actually a quotient. When all dividend rows ofa quotient group have been processed and all bitsof the bitmap are equal to 1, the quotient rowis added to the output. Otherwise, we reset allbits to zero, skip the current quotient row, andcontinue processing the next quotient candidate.Because of the group-by-group processing of theimproved algorithm, we call this approach hash-division for quotient groups (HDQ).The HDQ algorithm is non-blocking because

we return a quotient row to the output as soon asa group of (typically few) dividend rows has beenprocessed. In contrast, the HD algorithm has a �-nal output phase: the quotient rows are added tothe result table after the entire dividend has beenprocessed because hash-division does not assumea grouping on Q. For example, the \�rst" andthe \last" row of the dividend could belong to thesame quotient candidate, hence the HD algorithmhas to keep the state of the candidate quotientrow as long as at least one bit of the candidate'sbitmap is equal to zero. Note that it is possi-ble to enhance HD such that it is not a \fully"blocking algorithm. If bitmaps are checked dur-ing the processing of the input, HD could detectsome quotients that can be returned to the outputbefore the entire dividend has been scanned. Ofcourse, we would then have to make sure that noduplicate quotients are created, either by prepro-cessing or by referential integrity enforcements orby keeping the quotient value in the hash tableuntil the end of the processing. In this paper, wedo not elaborate on this variation of HD.

3.3.7. Transposed Hash-Division for Quo-

tient Groups

We have seen that the HDQ algorithm is avariation of the HD algorithm: if the dividendis grouped on Q, we can do without a quotienthash table. Exactly the same idea can be appliedto HDT yielding an algorithm that we call trans-posed hash-division for quotient groups (HDTQ).For grouped quotient attributes, we can do

without the quotient hash table and we do notkeep long bitmaps in the divisor hash table butonly a single bit per divisor. Before any groupis processed, the bit of each divisor attribute isset to zero. For each group, we process the rows

like in the HDT algorithm. After a group is pro-cessed, we add a quotient to the output if the bitof every divisor row is equal to 1. Then, we resetall bits to zero and resume the dividend scan withthe next group.We do not show the pseudo code for the HDQ

and HDTQ algorithms for brevity. However, wesketch their data structures in the Figures 5(f)and (g) for the time when the group of dividendrows containing the quotient candidate Bob havebeen processed.

3.4. Aggregate Algorithms

This class of algorithms compares the numberof rows in each quotient candidate with the num-ber of divisor rows. In case of equality, a quotientcandidate becomes a quotient. All algorithmshave in common that in a �rst phase, the divi-sor table is scanned once to count the number ofdivisor rows. Each algorithm then uses di�erentdata structures to keep track of the number ofrows in a quotient candidate. Some algorithmsassume that the dividend is grouped on Q or D.

3.4.1. Nested-Loops Counting Division

Similar to scalar nested-loops division, nested-loops counting division (NLCD) is the most na��veway in the class of aggregate algorithms. This al-gorithm scans the dividend multiple times. Dur-ing each scan, NLCD counts the number of rowsbelonging to the same quotient candidate.We have to keep track of which quotient candi-

dates we have already checked, using a quotienthash table as shown in Figure 6(a). A globalcounter is used to keep track of the number of div-idend rows belonging to the same quotient can-didate. We fully scan the dividend in an outerloop: We pick the �rst dividend row, insert itsQ value into the quotient hash table, and set thecounter to 1. If the counter's value is equal tothe divisor count, we add the quotient to the out-put and continue with the next row of the outerloop. Otherwise, we scan the dividend in an in-ner loop for rows with the same Q value as thecurrent quotient candidate. For each such row,the counter is checked and in case of equality, thequotient is added to the output. When the endof the dividend is reached in the inner loop, we

15

Bob

Alice

quotient hash table

quotientattribute

divisor counter

3

current quotientcounter

3

(a) Nested-Loops Counting Division (NLCD)

enrollment (dividend) course (divisor)

Compilers

TheoryDatabases

course_id

2

2

3

CompilersChris

TheoryBobDatabasesBobCompilersBobTheoryAliceCompilersAliceTheoryChris

course_idstudent_id

divisor counter

3

quotient counter

3

(b) Merge-Count Division (MCD)

quotientcounter

3

2

2

Chris

Bob

Alice

quotient hash table

quotientattribute

divisor counter

3

(c) Hash-Division for Divisor Groups (HDD)

0 1 2

quotient hash table

Chris

Bob

Alice 0

1

2

quotientattribute

quotientnumber

divisor counter vector

232

(d) Transposed Hash-Division for Divisor Groups(HDTD)

bit

1Bob

quotient hash table

quotientattribute

divisor counter

3

(e) Stream-Join Division (SJD)

Figure 6. Overview of data structures used in aggregate algorithms. Broken lined boxes indicate that aquotient is found. Only Bob's group has as many dividend rows as the divisor.

continue with the next row of the outer loop andcheck the hash table if this new row is a new quo-tient candidate.The time and memory complexities are the

same as for nested-loops division.

3.4.2. Merge-Count Division

Assuming that the dividend is grouped on Q,merge-count division (MCD) scans the dividendexactly once. After a quotient candidate has beenprocessed and the number of rows is equal tothose of the divisor, the quotient is added to theoutput. Note that the size of a quotient group

cannot exceed the number of divisor groups be-cause we have to guarantee referential integrity.The aggregate algorithm merge-count division

is similar to the scalar algorithms MSD andMGD, described in Sections 3.3.2 and 3.3.3.Instead of comparing the elements of quotientgroups with the divisor, MCD uses a represen-tative (the row count) of each quotient group tocompare it with the divisor's aggregate. Figure6(b) illustrates the single scan required to com-pare the size of the each quotient group with thedivisor size.MCD has a worst case time complexity of

16

O(jSj + jT j) and an typical case time complex-ity of O(jSj). Since no signi�cant data structureshave to be kept in memory except for the currentdividend row and the counters, the worst case andtypical case memory complexity is O(1).

3.4.3. Hash-Division for Divisor Groups

In Section 3.3.6, we have analyzed optimiza-tions of hash-division that require a dividendthat is grouped on Q. We now show some op-timizations of hash-division for a dividend that isgrouped on D. Unlike the hash-division-like al-gorithms based on quotient groups, the followingtwo algorithms are blocking.This algorithm does not need a divisor hash

table because after a divisor group of the divi-dend has been consumed, the divisor value willnever reappear. We use a counter instead of abitmap for each row in the quotient hash table.We call this adaptation of the HD algorithm hash-division for divisor groups (HDD). The algorithmmaintains a counter to count the number of di-visor groups seen so far in the dividend. Foreach dividend row of a divisor group, we incre-ment the counter of the quotient candidate. Ifthe quotient candidate is not yet contained in thequotient hash table, we insert it together with acounter set to 1. When the entire dividend hasbeen processed, we return those quotient candi-dates in the quotient hash table whose counter isequal to the global counter.

3.4.4. Transposed Hash-Division for Divi-

sor Groups

The last algorithmic adaptation that wepresent is called transposed hash-division for di-visor groups (HDTD), based on the HDT algo-rithm. We can do without a divisor hash table,but we keep an array of counters during the scanof the dividend. The processing is basically thesame as the previous algorithm (HDD): We re-turn only those quotient candidates of the quo-tient hash table whose counter is equal to thevalue of the global counter. Because all divisorgroups have to be processed before we know allquotients, this algorithm is also blocking.We do not show the pseudo code for the HDD

and HDTD algorithms for brevity. However, we

sketch the data structures used in the Figures 6(c)and (d) for the time when the entire dividend hasbeen processed. Note that the dividend containsonly three divisor groups (no Graphics rows), be-cause we require that referential integrity betweenenrollment and course is preserved, e.g., by apply-ing a semi-join of the two tables before division,as in Figure 4. Bob is the only student who iscontained in all three divisor groups.The complexities of HDD and HDTD are the

same. Their worst and typical case time com-plexity is O(jSj + jT j) and O(jSj), respectively.The worst and typical case memory complexity isO(jSj).

3.4.5. Stream-Join Division

The new algorithm stream-join division (SJD)[20] is an improvement of hash-division for divisorgroups (HDD). As all other algorithms assuminga dividend that is grouped onD as the only or themajor set of group attributes, SJD is a blockingalgorithm. SJD is hybrid because it counts thenumber divisor rows, like all other aggregate al-gorithms, and it maintains several bits to memo-rize matches between dividend and divisor, like allother scalar algorithms. However, in this paper,we consider SJD an aggregate algorithm due toits similarity to HDD.The major di�erences between SJD and HDD

are:

� SJD stores a bit instead of a counter to-gether with each quotient candidate in thequotient hash table.

� SJD is able to remove quotient candidatesfrom the quotient hash table before the endof the processing.

The SJD algorithm works as follows. As inHDD, we maintain a counter to count the num-ber of divisor groups seen so far in the dividend.First, we insert all quotient candidates, i.e., Qvalues, of the �rst group in the dividend togetherwith a bit initialized with zero into the quotienthash table. We thereby eliminate possible dupli-cates in the dividend. Then, we process each fol-lowing group as follows. For each dividend row ofthe current group, we look up the quotient can-didate in the quotient hash-table. In case of a

17

match, the corresponding bit is set to 1. Other-wise, i.e., when the Q value of a given dividendrow is not present in the quotient hash table, weskip this row. After a group has been processed,we remove all quotient candidates with a bit equalto zero. Then, we reset the bit of each remain-ing quotient candidate to zero. Finally, when allgroups have been processed, we compare the cur-rent group counter with the number of rows inthe divisor. In case of equality, all quotient can-didates in the quotient hash table with a bit equalto 1 are added to the output.Figure 6(e) illustrates the use of the quotient

hash table in SJD. We assume that the dividendis equal to the enrollment table of class 2 in Figure2(b) with the exception that the Graphics groupf(Bob, Graphics), (Chris, Graphics)g is missing,due to referential integrity. We show the contentsof the hash table for the time when the entireenrollment table has been processed. We see thatChris and Alice are not contained in the hashtable because both have already been eliminatedafter the second group (Databases). Only Bob'sbit is set to 1 and he is a quotient row because thenumber of groups (3, without Graphics) is equalto the number of divisor rows.The advantage of SJD lies in the fact that the

amount of memory can decrease but will neverincrease after the quotient candidates have beenstored in the quotient hash table. However, thetime and memory complexity is the same as forHDD. Observe that the maximum amount ofmemory required is proportional to the numberof rows of the �rst group in the dividend. It mayhappen by chance that the �rst group is the small-est of the entire dividend. In this case, we obtaina very memory-eÆcient processing.This algorithm is called stream-join division be-

cause it joins all divisor groups of the dividend(called streams in [20]) with each other on theattributes Q.

4. Evaluation of Algorithms

In this section, we brie y compare the divisionalgorithms discussed in Section 3 with each otherand show which algorithm is optimal, with re-spect to time and memory complexities, for each

class of input data discussed in Section 2.Table 3 characterizes the algorithms presented

so far and shows the time and memory com-plexities involved. We assigned the algorithmsto those data classes that have the least restric-tions with respect to grouping. Remember thatan algorithm of class C can also process data ofclasses that are reachable from C in the depen-dency graph in Figure 3. The overview of di-vision algorithms in Table 3 shows that, despitethe detailed classi�cation in Table 1 (comprising13 classes and enumerating all possible kinds ofinput data), there are four major classes of in-put data that are covered by dedicated divisionalgorithms:

� class 0, which makes no assumption ofgrouping,

� class 2, which covers dividends that aregrouped only or �rst on D,

� class 5, which covers dividends that aregrouped only or �rst on Q, and �nally

� class 10, which specializes class 5 (and class0, of course) by requiring that for each quo-tient group, the rows of D and the divisorappear in the same order. Hence, the div-idend is grouped on Q as major and D asminor.

Note that algorithms for class 2, namely HDD,HDTD, and SJD, have not been identi�ed in theliterature so far. They represent a new straight-forward approach to deal with a dividend thatis grouped on D. Together with the other threemajor classes, a query optimizer can exploit theinformation on the input data properties to makean optimal choice of a speci�c division operator.Suppose we are given input data of a class that

is di�erent from the four major classes. Which al-gorithms are applicable to process our data? Ac-cording to the graph in Figure 3, all algorithmsbelonging to major classes, which are direct or in-direct parent nodes of the given class, can be used.For example, any algorithm of major classes 0 and5 can process data of the non-major classes 6, 7,and 9.

18

Division Algorithm Data Dividend S Divisor T Complexity in O-NotationAlgorithm Class Class Time Memory

Q D worst typical worst typical

NLCD aggregate 0 N N N jSj2 + jT j jSj2 1 1NLD scalar jSj2 + jT j jSj2 jSj+ jT j jSjHD scalar jSj+ jT j jSj jSjjT j jSjjT jHDT scalar jSj+ jT j jSj jSjjT j jSjjT j

HDD aggregate 2 N G N jSj+ jT j jSj jSj jSjHDTD aggregate jSj+ jT j jSj jSj jSjSJD aggregate jSj+ jT j jSj jSj jSj

MCD aggregate 5 G N N jSj+ jT j jSj 1 1HDQ scalar jSj+ jT j jSj jT j 1HDTQ scalar jSj+ jT j jSj jT j 1

MGD scalar 10 G1 G2 G2 jSjjT j jSjjT j jT j 1MSD scalar S2 S2 jSjjT j jSjjT j 1 1

Table 3Overview of division algorithms showing for each algorithm the class of required input data, its algorithmclass, and its time and memory complexities. Input data are either not grouped (N), grouped (G), orsorted (S). Class 10 is �rst grouped on Q, indicated by G1. For each quotient group, it is grouped(G2) or sorted (S2) on D in the same order as the divisor. The algorithm names corresponding to theabbreviations in the �rst column are given in Table 2.

Several algorithms belong to each class of inputdata in Table 3. In class 0, both HD and HDThave a linear time complexity (more precisely,nearly linear due to hash collisions). However,they have a higher memory complexity than theother algorithms of this class, NLCD and NLD.We have designed three aggregate algorithms

for class 2. They all have the same linear timeand memory complexities.Class 5 has two scalar and one aggregate al-

gorithm assigned to it, which all have the sametime complexity. The constant worst case mem-ory complexity of MCD is the lowest of the three.The two scalar algorithms HDQ and HDTQ of

class 10, which consists of two subgroups (sortedand grouped divisor values) have the same timecomplexity. The worst case memory complexityof MSD is lower than that of MGD because MSDcan exploit the sort order.It is important to observe that one should not

directly compare complexities of scalar and ag-gregate algorithms in Table 3 to determine themost eÆcient algorithm overall. This is becauseaggregate algorithms require duplicate-free inputtables, which can incur a very costly preprocess-

ing step. There is one exception of aggregate al-gorithms: SJD ignores duplicate dividend rowsbecause of the hash table used to store quotientcandidates. It does not matter if a quotient oc-curs more than once inside a divisor group be-cause the bit corresponding to a quotient candi-date can be set to 1 any number of times withoutchanging its value (1). However, SJD does notignore duplicates in the divisor because it countsthe number of divisor rows.In general, scalar division algorithms ignore du-

plicates in the dividend and the divisor. Notethat the scan operations of MGD and MSD canbe implemented in such a way that they ignoreduplicates in both inputs [8]. However, to sim-plify our presentation, the pseudo code of MSDin the Appendix does not ignore duplicates.Let us brie y illustrate some example issues

that we have to take into account when compar-ing division algorithms. The �rst issue is timeversus memory complexity. In class 0, for exam-ple, four algorithms have been identi�ed. NLCDand NLD have a quadratic time complexity com-pared to the linear complexities of HD and HDT.Despite the di�erent processing performance of

19

these algorithms, a query optimizer may prefer topick a division operator based on the NLCD al-gorithm to HD and HDT if the estimated amountof input data is small and the optimizer wants toavoid the overhead of building hash tables. Wedo not go into the details of query optimizationhere because, in general, the choice of picking aspeci�c operator from a set of logically equivalentoperators (like join and division) also depends onfactors other than time and memory complexity,as we have mentioned in Section 2.2. Neverthe-less, time and memory consumption are the dom-inant factors in reality.The second issue is about the eÆciency of a

query processor for certain operations. We pre-sented two di�erent approaches for hash-division:the classic approach (HD), where bitmaps arestored together with quotient candidates in thequotient hash table, and a new approach (HDT)where bitmaps are stored with each divisor rowin the divisor hash table (see Figures 5(d) and(e) for illustrations). These dual approaches mayseem interchangeable at �rst sight with respect toeÆciency. However, in some situations, a queryoptimizer may prefer one to the other, dependingon how eÆciently the system processes bitmaps.Suppose the system can process a few extremelylong bitmaps more eÆciently than many shortbitmaps. If there are many quotient candidatesin the input data (which is typical) but there is arelatively short divisor, then the bitmaps storedin HD are relatively short but there are many ofthem. In contrast, HDT would build very longbitmaps (which may be the deciding factor) butonly a few of them would be stored in the divisorhash table. Analogously, the optimizer may pre-fer HD to HDT if the input consists of few butvery large quotient candidates. Similar situationsapply to the other pairs of transposed and non-transposed algorithms, i.e., for the HDD/HDTDand HDQ/HDTQ pairs.

5. The Set Containment Problem

Universal quanti�cation checks if all elementsof a given set ful�ll a given condition. In many ap-plications, this condition is a set element member-ship test, i.e., the quanti�cation problem becomes

a set containment problem. For example, theproblem stated in Section 1.1 can be rephrasedas follows: \Find the students whose associatedset of enrolled courses contains the given set ofcourses o�ered by the department."

5.1. Set Storage Representations

Division is an operator of the relational alge-bra, which is based on the relational model. Inthe basic relational model all relations are in �rstnormal form (1NF), i.e., all attribute domainsare atomic. One possible extension of the rela-tional model provides relations with multivaluedattributes, where the attribute domain is a col-lection type like bag or set, de�ned on top ofa primitive domain like oat or string. A morerigorous extension of the relational model is thenested relational model [14, 16], where attributescan be relations themselves.There are basically two orthogonal classi�ca-

tions for the storage representation of sets: nest-ing and location [21]. The attribute values arestored as multiple values: the nested represen-tation stores the values as a variable length at-tribute and the unnested representation storesthem as multiple tuples.In a classi�cation based on the storage loca-

tion, one can distinguish between an internal rep-resentation where the set elements are stored to-gether with the accompanying attribute valuesand an external representation, where the set el-ements are stored in a separate auxiliary tableconnected by foreign key references, as depictedin Figure 7, according to [21]. In this �gure, weshow as an example a single tuple of the relationenrollment(student id; courses), where student idis an atomic attribute and courses is a set-valuedattribute. Here, we represent the fact that thestudent Chris is enrolled in the courses Compil-ers, Graphics, and Theory. Only the unnestedinternal representation conforms to the 1NF.

5.2. Set Containment Join and Relational

Division

The set containment problem has been studiedin great detail in the past [10, 17{19,21, 22]. Inparticular, several eÆcient set containment testalgorithms have been developed and storage data

20

{Compilers, Graphics, Theory}Chris(a) Nested internal

TheoryChrisGraphicsChrisCompilersChris

(b) Unnestedinternal

Chris

{Compilers, Graphics, Theory}(c) Nested external

Chris

TheoryGraphicsCompilers

(d) Unnested external

Figure 7. Storage representations of set-valuedattributes.

structures to represent sets in relational, object-relational, and object-oriented databases are dis-cussed.It is interesting to observe that the division op-

erator is closely related to set containment join,which can be implemented eÆciently [17, 21]. Setcontainment join (SCJ), denoted by 1�, is a joinbetween the set-valued attributes b and c of tworelations R(a; b) and S(c; d):

S 1c�b R = ftjt 2 S�R ^ c�bg:

Figure 8 illustrates an example computation ofthe set containment join based on the scenario in-troduced in Section 1.1. Only the table course hasbeen changed by adding an additional attributeprogram that indicates which combination of CScourses are required for a certain advanced pro-gram. We �nd that Bob has all prerequisitesto specialize in Systems and Applications whileChris is only allowed to specialize in Applications.Suppose, the tables course and enrollment are

de�ned as before and that the layout of the set-valued attribute courses is unnested internal forboth tables, as sketched in Figure 9. We have not

enrollment

student id courses

Alice fC, TgBob fC, D, G, TgChris fC, G, Tg

(a) R(a; b)

course

courses program

fC, D, Tg SystemsfC, Gg Applications

(b) S(c; d)

course 1courses�courses enrollment

student id courses courses program

Bob fC, D, G, Tg fC, D, Tg SystemsBob fC, D, G, Tg fC, Gg ApplicationsChris fC, G, Tg fC, Gg Applications

(c) S 1c�b R = T (a; b; c; d)

Figure 8. An example computation of the set con-tainment join operator (1�) based on relations innon-�rst normal form employing a set-valued at-tribute.

enrollment


Alice CompilersAlice TheoryBob CompilersBob DatabasesBob GraphicsBob TheoryChris CompilersChris GraphicsChris Theory

(a) R(a; b)

course

course id program

Compilers SystemsDatabases SystemsTheory SystemsCompilers ApplicationsGraphics Applications

(b) S(c; d)

enrollment�course id�course id course

student id program

Bob SystemsBob ApplicationsChris Applications

(c) R �b�c S = T (a; d)

Figure 9. An example computation of the setcontainment division operator (��) based on re-lations in �rst normal form.

found a de�nition of such a result table in a nestedinternal representation in the literature. Since alljoin attributes are preserved, it is unclear how

21

the rows belonging to a set on the one side arecombined with tuples of another set on the otherside. One possible de�nition for representing thematches could be to pair each row from the leftside with each row of the right side, i.e., one couldcompute the Cartesian product between the twogroups of tuples that ful�ll the set containment.Because of this problem, we devised an exten-

sion of the division operator, called set contain-ment division (��) that returns the same rowsas the set containment join but that delivers onlythe columns of the non-join attributes. Figure 9illustrates the behavior of set containment divi-sion based on the same input data as in Figure 8but using a 1NF data layout.Formally, the set containment division can be

expressed with the help of (basic) relational divi-sion as follows, again based on the two relationsR(a; b) and S(c; d):

T (a; d) = R�b�c S

=[

x2�d(S)

((R� �c(�d=x(S)))� (x))

The idea of this expression is to merge the re-sult of several divisions. In each division, the en-tire dividend R is divided by those tuples of thedivisor S, which belong to the same group. Thereare as many divisions as the number of distinctvalues of S:d. We append the value of the cur-rent group to all result tuples of each division,speci�ed by the Cartesian product.We have seen that set containment join and

relational division are very similar. We havedemonstrated the similarity by de�ning an oper-ator that operates on 1NF data like division butwhich can process many sets on both sides of theinput like set containment join. The character-istics of the three operators discussed before aresummarized in Table 4.

5.3. Overview of Set Containment Join Al-

gorithms

The set containment join algorithms that havebeen proposed in the literature are based on sig-natures [10] and partitioning [22]. Enhanced ap-proaches combining both techniques have beendeveloped, which signi�cantly outperform allknown previous approaches [17{19]. All these al-

gorithms assume that the data is managed by thedatabase system in a non-1NF way, i.e., the datacan be everything but unnested internal, whichis the layout assumed for the set containment di-vision and basic division problems. However, in[19] the new approaches are compared to SQL-based approaches based on counting the numberof elements in the join result of both sets and com-paring it to the set cardinality of the candidatesubset. Such a comparison is incomplete becauseother SQL-based approaches using NOT EXISTS(as for division) have not been taken into consid-eration, as described in Section 3.2.A recent study compared set containment joins

based on a nested internal and an unnested in-ternal set representation [11], also based on thecounting approach, only. In particular, in thenested approach, a user-de�ned containment testpredicate is employed that takes two set-valuedattributes as parameters. According to currentdatabase technology for evaluating user-de�nedpredicates, the commercial system in use is forcedto apply the test predicate on the result tableof a Cartesian product of both input tables. Byrewriting the query into one using an unnestedlayout, a table function is employed that unneststhe set-valued attribute into a table. The op-timizer of the system used in their experimentsdecided to �rst build an intermediate result tablethat comprises the set ID and the element valueas attributes, sorted on the element values. Then,the query execution plan suggests to merge-jointhe two sorted input streams on the element valueattributes. After that, the sorted data is groupedon the set IDs and set cardinalities. Finally, a �l-ter condition appends only those set ID pairs tothe result, where the cardinality of the containedset is equal to the number of matches for this pairof sets. The experiments of this study have shownthat the e�ort of unnesting the sets and prepro-cessing the data by sorting it on the attributes tobe matched can greatly improve the straightfor-ward nested-loops approach. Unfortunately, theresults have not been compared to more sophis-ticated approaches as the ones proposed, for ex-ample, in [17].The research results on relational division

should be applied to set containment in future

22

Division Set Containment Division Set Containment Join

Operator and input relations R(a; b)� S(c) R(a; b)�b�c R(c; d) S(c; d) 1c�b R(a; b)

Left input / dividend many groups many groups many sets

Right input / divisor single group many groups many sets

Result / quotient attributes T (a) T (a; d) T (a; b; c; d)

Data layout 1NF 1NF non-1NF

Table 4Summary of operator characteristics

work since the division problem can be consid-ered a sub-problem of set containment join underthe assumption that the sets are stored using anunnested internal representation. The main dif-ference between both operations is that divisionis applied to a single set of dividend set elements,whereas set containment join compares possiblymultiple sets from both sides of the join with eachother. To the best of our knowledge, the strongcommonality between relational division and setcontainment join has not been identi�ed and in-vestigated before.

6. New Applications for Universal Quan-

ti�cation

In this section, we �rst argue why businessintelligence problems are likely to bene�t frombeing expressed using SQL. Then, we suggest anovel approach to compute frequent itemsets|apopular data mining task|which employs uni-versal quanti�cations in the SQL queries.

6.1. Database Mining

In business intelligence applications, severaldata mining and OLAP techniques are employedto extract novel and useful information from hugecorporate data sets. Typically, the data sets aremanaged by a data warehouse that is based on re-lational database technology. Although the termdata mining and, even more so, knowledge dis-covery in databases (KDD) suggest that the algo-rithms explore databases, most commercial toolsmerely process at �les. If they do access adatabase system, then database tables are used asa container to read and write data, similar to a �lesystem. The query optimization and processingfacilities of current database systems are hardly

ever exploited by current data mining tools. Thereasons for this certainly include:

� Portability: A data mining application thatdoes not rely on a query language can be de-ployed more easily because no assumptionson the language's functionality have to bemade.

� Performance: A highly tuned black-box al-gorithm with in-memory data structureswill always be able to outperform any queryprocessor that employs a combination ofgeneric algorithms.

� Secrecy: A tool vendor does not want to re-veal application logic. By employing SQL-based algorithms, the database administra-tor will be able to see these queries.

Despite these arguments against SQL-baseddata mining algorithms, exploiting the querylanguage power for expressing data mining(sub)problems can solve several important prob-lems:

� Data currency: The latest updates appliedto the data warehouse are re ected in thequery result. No (replicated) data copieshave to be maintained.

� Scalability: If extremely large data sets areto be mined then it is much easier to de-sign a scalable SQL-based algorithm thandesigning an algorithm that has to managedata in external �les. The storage man-agement is one of the key strengths of adatabase system.

� Adaptability to data: A database optimizertries to �nd the best possible execution

23

strategy based on the current data charac-teristics for a given query. Of course, insome situations this will not help. Simi-lar to choosing a di�erent proprietary al-gorithm for certain data characteristics, itmay be better to employ a di�erent query.

The latter three arguments motivated our re-search on SQL-based algorithms for several datamining methods. One of these methods is dis-cussed in the following section.

6.2. Frequent Itemset Discovery with SQL

In this section, we �rst brie y introduce thefrequent itemset discovery problem and explainthe relationship between frequent itemset discov-ery and relational division and set containmentjoin. Then, we present a new approach for thisproblem, which makes use of universal quanti�-cations.

6.2.1. The Frequent Itemset Discovery

Problem

The computation of frequent itemsets is a com-putationally expensive preprocessing step for as-sociation rule discovery, which �nds rules in largetransactional data sets [1]. Frequent itemsets arecombinations of items that appear frequently to-gether in a given set of transactions. Associationrules characterize, for example, the purchasingpattern of retail customers or the click pattern ofweb site visitors. Such information can be used toimprove marketing campaigns, retailer store lay-outs, or the design of a web site's contents andhyperlink structure.Given a set of transactions, the frequent itemset

discovery problem is to �nd itemsets within thetransactions that appear at least as frequently asa given threshold, called minimum support. Forexample, a user can de�ne that an itemset is fre-quent if it appears in at least 2% of all transac-tions.Almost all frequent itemset discovery algo-

rithms consist of a sequence of steps that proceedin a bottom-up manner: the result of the k-thstep is the set of frequent k-itemsets, denoted asFk. The �rst step computes the set of frequentitems (1-itemsets). Each following step consistsof two phases:

1. The candidate generation phase computes aset of potentially frequent k-itemsets fromFk�1. The new set is called Ck, the set ofcandidate k-itemsets. It is a superset of Fk.

2. The support counting phase �lters out thoseitemsets from Ck that appear more fre-quently in the given set of transactions thanthe minimum support and stores them inFk.

The key problem of frequent itemset discov-ery is: \How many transactions contain a cer-tain given itemset?" This question can be an-swered in relational algebra using the divisionoperator. Suppose that we have a relationTransaction(tid; item) containing a set of trans-actions and a relation Itemset(item) containinga single itemset, each row containing one item.We want to collect those tid values in a rela-tion Contains(tid), where for all tuples in Itemsetthere is a corresponding tuple in Transaction thathas a matching item value together with that tid.In relational algebra, this problem can be statedas

Transaction(tid; item)� Itemset(item)

= Contains(tid):

The example in Figure 10 illustrates the divi-sion operation. The Transaction table consistsof three transactions and two of them contain allitems of Itemset. We simply have to count thevalues in Contains to decide if the itemset is fre-quent. For example, if the minimum support isset to 60% then the given itemset is considereda frequent itemset because the support is 2=3,which is greater than 60%. Using division termi-nology, Transaction plays the role of the dividend,Itemset represents the divisor, and Contains is thequotient.Unfortunately, frequent itemset discovery poses

the additional problem that we have to checkmany (candidate) itemsets if they are frequent,i.e., unlike Figure 10(b), we usually do not havea constant divisor relation but we need many di-visor relations. However, we can employ eÆcientalgorithms for this problem. We could arrangethe itemsets in a table Itemset (itemset, item)

24

Transactiontid item

1001 diapers1001 beer1001 chips1002 chips1002 diapers1003 beer1003 avocados1003 chips1003 diapers

(a) Dividend

Itemsetitem

chipsbeerdiapers(b) Divisor

Containstid

10011003(c) Quotient

Figure 10. Relationship between the frequentitemset discovery problem and relational division:Transaction � Itemset = Contains.

and apply the division operation to each itemsetgroup, separately. As shown in Section 5.2, thisproblem can also be expressed by set containmentdivision:

Itemset(itemset; item)�item�itemTransaction(tid; item)

= Contains(itemset; tid)

Another approach is to use the standard setcontainment join, which requires switching fromthe 1NF data representation to a non-1NF rep-resentation that uses set-valued attributes. Wewould have to preprocess the tables by trans-forming the item values of each group, de�nedby the itemset and tid values, respectively, into aset. Instead of the above tables in 1NF, the non-1NF tables would have a schema like: Itemset(itemset, itemvalues) and Transaction (tid, item-values), each having a set-valued attribute item-values.

6.2.2. Support Counting in SQL

In this paper, we focus on the support count-ing phase of frequent itemset discovery. For typ-ical data sets, this phase is much more computa-tionally expensive than the candidate generationphase.

SELECT c.itemset, COUNT(*) AS supportFROM Ck AS c, T AS t1, T AS t2, ..., T AS tkWHERE c.item1 = t1.item AND

c.item2 = t2.item AND...c.itemk = tk.item ANDt1.tid = t2.tid ANDt1.tid = t3.tid AND...t1.tid = tk.tid

GROUP BY c.itemsetHAVING support >= @minimum_support

Figure 11. Support counting phase according tothe K-Way-Join algorithm.

SELECT itemset, COUNT(DISTINCT tid) AS supportFROM (

SELECT c1.itemset, t1.tidFROM Ck AS c1, T AS t1WHERE NOT EXISTS (

SELECT *FROM Ck AS c2WHERE NOT EXISTS (SELECT *FROM T AS t2WHERE NOT (c1.itemset = c2.itemset) OR

(t2.tid = t1.tid ANDt2.item = c2.item)))

) AS ContainsGROUP BY itemsetHAVING support >= @minimum_support

Figure 12. Support counting phase according tothe Quiver algorithm.

25

There are several approaches to express thesupport counting phase in SQL. Most of themare based on SQL-92. The SETM algorithm isthe �rst SQL-based approach described in the lit-erature [12]. Several researchers have suggestedimprovements of SETM. It has been shown thatSETM does not perform well on large data setsand new approaches have been devised, like forexample Three-Way-Join, Subquery, and Two-Group-Bys [25]. The algorithms presented inthat paper perform di�erently for di�erent datacharacteristics. Subquery is reported to be thebest approach overall compared to the other ap-proaches based on SQL-92. The reason is thatit exploits common pre�xes between candidate k-itemsets when counting the support.More recently, an approach called Set-Oriented

Apriori has been proposed [27]. The authorsargue that too much redundant computationsare involved in each support counting phase.Their performance results have shown that Set-Oriented Apriori performs better than Subquery,especially for high values of k.We contrast our novel approach to previous

approaches based on SQL-92 where the data isstored in 1NF, i.e., we do not investigate set-valued attributes, for example. One of the ap-proaches based on SQL-92 is K-Way-Join [25],illustrated in Figure 11. The K-Way-Join ap-proach, which is based on SQL-92, uses k in-stances of the transaction table and joins it ktimes with itself and with a single instance ofCk. Same as all other known approaches basedon SQL-92 that use a 1NF representation of item-sets, K-Way-Join assumes that the frequent andcandidate k-itemsets are stored in a single row:(itemset; item1; : : : ; itemk). However, the giventransactions are stored as multiple rows using theschema (tid; item). As we will show in the follow-ing section, our novel approach uses a data layoutwhere itemsets are stored as multiple rows, sameas the transactions.

6.2.3. Support Counting and Universal

Quanti�cation

Based on the idea of using division to spec-ify the itemset containment problem, we deviseda complete algorithm, called Quiver (QUanti�ed

Itemset discovery using a VERtical table layout)[23], that employs SQL queries containing univer-sal quanti�cations for both phases of the discov-ery task. The reason for devising a new approachis twofold:

1. We want to formulate intuitive queries thatnaturally express the universal quanti�ca-tion problem: \Count the number of trans-actions where for each transaction, all itemsof a given itemset are contained in thetransaction." Previous approaches for SQL-based frequent itemset discovery are mostly\hardwired" queries, i.e., the quanti�cationis circumvented by using many join con-ditions between individual items of candi-dates and transactions (as shown for the K-Way-Join approach in the previous section).

2. We want to employ a exible itemset repre-sentation that is similar to the way trans-actions are stored in a database: Trans-action (tid; item). In all previous ap-proaches that use a 1NF representation,k-itemsets are stored as a single row:(itemset; item1; : : : ; itemk). Instead of this\horizontal" layout, Quiver uses a \ver-tical" layout, where a k-itemset is repre-sented as k rows in the three-column table(itemset; position; item). One bene�t of thisvertical layout is its ability to store evenvery large itemsets because in commercialdatabase systems the maximum number ofcolumns in a table is signi�cantly lower thanthe number of rows.

In the following, we describe only the supportcounting phase of Quiver and we focus on thecore problem, universal quanti�cation. The en-tire approach, including the candidate generationphase using universal quanti�cation, is describedin detail in [23].The query for support counting is �rst pre-

sented with help of tuple relational calculus sincethe calculus o�ers a universal quanti�er to conve-niently express the quanti�cation. After this, weshow how to derive an equivalent SQL query. Asexplained in Section 3.2, SQL does not o�er a uni-versal quanti�er, therefore the query is expressed

26

with the help of negated existential quanti�ers.Since Quiver follows the classical iterative

two-phase approach, suppose that we havecomputed the set of candidate k-itemsetsCk(itemset; position; item) based on the set of fre-quent (k � 1)-itemsets Fk(itemset; position; item)during the �rst phase of the k-th iteration, withk � 2. The set of transactions is given by tableT (tid; item).We express the query Q in tuple relational cal-

culus to derive combinations of transactions andcandidates as

Q = f(c1 2 C; t1 2 T )jContainsg:

The query can be applied to candidate itemsetsof any size. Therefore, the parameter k of theparticular candidate set Ck is omitted for brevity.The Contains expression of this query is de�nedas

Contains = 8c22C9t22T

(c2:itemset = c1:itemset)!

(t2:tid = t1:tid ^

t2:item = c2:item):

The expression has two free tuple variables c1and t1, where c1 represents a candidate itemsetand t1 is a transaction that contains c1. Thequanti�ed (bound) tuple variables c2 and t2 repre-sent the items belonging to c1 and t1, respectively.The universal quanti�cation lies in the conditionthat for each item c2 belonging to itemset c1,there must be an item t2 belonging to transac-tion t1 that matches with c2.A combination (c1; t1) ful�lling the calculus

query Q indicates that the itemset c1:itemset iscontained in the transaction t1:tid. We can �ndthe support of each candidate by counting thenumber of distinct values t1:tid that appear in acombination c1:itemset. We do not show the ac-tual counting because the basic calculus does notinclude aggregate functions.Since we are interested in an SQL representa-

tion of the given calculus query, we translate itinto SQL in a straightforward manner by apply-ing the following transformations:

� Quanti�ers: As already explained be-fore, there is no universal quanti�eravailable in SQL. Therefore, we trans-late 8x2R : f(x) � :9x2R : f(x)into \NOT EXISTS (SELECT * FROM R AS

x WHERE NOT f(x))."

� Implications: We replace an implication bya disjunction, i.e., we transform f ! g �:f _ g into \NOT f OR g."

� Negations: We use De Morgan's rules :(f ^g) � :f _ :g and :(f _ g) � :f ^ :g forpushing a negation into a conjunction or adisjunction.

The resulting SQL query for support counting,shown in Figure 12, contains two nested \NOTEXISTS" expressions analogous to the exampleSQL query used to express the student's enroll-ment problem in Section 3.2. Note that the queryin Figure 12 has to apply the aggregation on theset of unique transaction IDs because duplicatescan occur as a result of the query processing.To conclude this section, we point out that the

Quiver approach shows how an important datamining task can be expressed in a natural way us-ing universal quanti�cation. If a database systemwere able to recognize the quanti�cation prob-lem inside queries like the one in Figure 12, itcould employ the most eÆcient algorithm thatrealizes the division operator, set containment di-vision operator, or set containment join operator(discussed in Section 5), taking into account thecurrent data characteristics, as explained in theprevious sections. This is especially important ifthe data mining problem is a part of a larger,more complex query, involving several additionalpredicates. For example, consider a supermarketscenario, where we restrict our analysis to trans-actions of the years 1999{2001, and we are onlyinterested in items of the product category \softdrinks." Such additional predicates can signi�-cantly in uence the choice on the most eÆcientalgorithm for the quanti�cation problem.

7. Related Work

Quanti�ers in queries can be expressed by re-lational algebra. Due to the lack of eÆcient di-

27

vision algorithms in the past, early work has re-commended avoiding the relational division oper-ator to express universal quanti�cation in queries[3]. Instead, universal quanti�cation is expressedwith the help of the well-known anti-semi-join op-erator, or complement-join, as it is called in thatpaper.Other early work suggests approaches other

than division to process (universal) quanti�ca-tion [5, 6]. Universal quanti�cation is expressedby new algebra operators and is optimized basedon query graphs in a non-relational data model[6]. Due to the lack of a performance analysis,we cannot comment on the eÆciency of this ap-proach.The research literature provides only few sur-

veys of division algorithms [4, 7, 8]. Some of thealgorithms reviewed in this paper have been com-pared both analytically and experimentally [8].The conclusion is that hash-division outperformsall other approaches. Complementing this work,we have shown that an optimizer has to take theinput data characteristics and the set of given al-gorithms into account to pick the best divisionalgorithm. The classi�cation of four division al-gorithms in [8] is based on a two-by-two ma-trix. One axis of the matrix distinguishes betweenalgorithms based on sorting or based on hash-ing. The other axis separates \direct" algorithms,which allow processing the (larger) dividend tableonly once, from \indirect" algorithms, which re-quire duplicate removal (by employing semi-join)and aggregation. For example, the merge-sort di-vision algorithm of Section 3.3.2 falls into the cat-egory \direct algorithm based on sorting," whilethe hash-division for divisor groups algorithm ofSection 3.4.3 belongs to the combination \indirectalgorithm based on hashing." Our classi�cationdetails these four approaches and focuses on thefact that data properties should be exploited asmuch as possible by employing \slim" algorithmsthat are separated from preprocessing algorithms,like grouping and sorting.Based on a classi�cation of queries that con-

tain universal quanti�cation, several query eval-uation techniques have been analyzed [4]. Theinput data of this algorithm analysis is stored inan object-oriented or object-relational database,

where set-valued attributes are available. Hence,the algorithms they examine can presupposethat the input data is grouped on certain at-tributes. For example, the table enrollment inFigure 1 could be represented by a set-valued en-rolled courses attribute of a student class. Theauthors conclude that universal quanti�cationbased on anti-semi-join is superior to all other ap-proaches, similar to the conclusion of [3]. Note,however, that this paper has a broader de�ni-tion of queries involving universal quanti�cationthan the classic de�nition that involves the di-vision operator. However, the anti-semi-join ap-proach requires a considerable overhead for pre-processing the dividend. An equivalent de�nitionof the division operator using anti-semi-join (n)as well as semi-join (n) and left outer join (1lo),is: S � T = ((S n T ) 1lo T )nT .In this paper, we focused on the universal (for-

all) quanti�er. Generalized quanti�ers have beenproposed to specify quanti�ers like \at least ten"or \exactly as many" in SQL [13]. Such quanti-�ers can be processed by algorithms that employmulti-dimensional matrix data structures [24]. Inthat paper, however, the implementation of anoperator called all is presented that is similar butdi�erent from relational division. Unlike division,the result of the all operator contains some at-tributes of the divisor. Hence, we have to employa projection on the quotient attributes of the alloperator's result to achieve a valid quotient.Transformation rules for optimizing queries

containing multiple (existential and universal)quanti�cations are presented in [15]. Our contri-bution complements this work by o�ering strate-gies to choose a single (division) operator, whichmay be one element of a larger query processingproblem.

8. Conclusion and Future Work

Based on a classi�cation of input data proper-ties, we were able to di�erentiate the major cur-rently known algorithms for relational division.In addition, we could provide new algorithms forpreviously not supported data properties. Thus,for the �rst time, an optimizer has a full range ofalgorithms, separated by their input data proper-

28

ties and eÆciency measures, to choose from.We are aware of the fact that database system

vendors are reluctant to implement several alter-native algorithms for the same query operator, inour case the division operation. One reason isthat the optimizer's rule set has to be extended,which can lead to a larger search space for queriescontaining division. Another reason is that theoptimizer must be able to detect a division in aquery. This is a non-trivial task because a di-vision cannot be expressed in SQL:1999 [2]. Nokeyword similar to \FOR ALL" [9] is availableand division has to be expressed indirectly, forexample by using nested \NOT EXISTS" clausesor by using the \division by counting" approachon the query language level. To the best of ourknowledge, there is no database system that hasan implementation of hash-division (or any of itsimprovements), although this eÆcient algorithmhas been known for many years [7]. However, webelieve that as soon as a dedicated keyword foruniversal quanti�cation is supported by the SQLstandard and its bene�t is recognized and ex-ploited by applications, many options and strate-gies are available today for database system ven-dors to implement an eÆcient division operator.The similarity between relational division and

the set containment join has been discussed forthe �rst time. This may lead to more researchthat investigates the possibility of representingsets in an unnested storage layout because eÆ-cient algorithms for division can be exploited. Wehave proposed a new operator, called set contain-ment division, that realizes set containment joinsfor data in �rst normal form.We have discussed an important application of

the division (and hence set containment) prob-lem, namely frequent itemset discovery. We planto investigate the potential of using universalquanti�cation in queries in further data miningmethods of business intelligence applications.Our future work includes the analysis of fur-

ther data properties that have an in uence onthe optimization of division queries, like the cur-rent data distribution or the availability of certainindexes. Furthermore, we will study the poten-tial of parallelizing division algorithms, based onthe detailed studies in [8] on parallelizing hash-

division and aggregate algorithms. In addition,the comparison between division and set contain-ment join algorithms deserves more attention. Inparticular, further investigations of both opera-tors need to take into account the cost of nestingand unnesting between the 1NF and the non-1NFstorage representations of sets in order to providefair performance comparisons.

REFERENCES

1. R. Agrawal and R. Srikant. Fast Algorithmsfor Mining Association Rules. In Proceed-ings VLDB, Santiago, Chile, pages 487{499,September 1994.

2. ANSI/ISO/IEC 9075-2. Information Tech-nology { Database Language { SQL { Part2: Foundation (SQL/Foundation), 1999.

3. F. Bry. Towards an EÆcient Evaluation ofGeneral Queries: Quanti�er and DisjunctionProcessing Revisited. In Proceedings SIG-MOD, Portland, Oregon, USA, pages 193{204, May{June 1989.

4. J. Clau�en, A. Kemper, G. Moerkotte, andK. Peithner. Optimizing Queries with Uni-versal Quanti�cation in Object-Oriented andObject-Relational Databases. In ProceedingsVLDB, Athens, Greece, pages 286{295, Au-gust 1997.

5. U. Dayal. Queries with Quanti�ers: A Horti-cultural Approach. In Proceedings PODS, At-lanta, Georgia, USA, pages 125{136, March1983.

6. U. Dayal. Of Nests and Trees: A Uni�edApproach to Processing Queries that ContainNested Subqueries, Aggregates, and Quanti-�ers. In Proceedings VLDB, Brighton, Eng-land, pages 197{208, September 1987.

7. G. Graefe. Query Evaluation Techniques forLarge Databases. TKDE, 25(2):73{170, June1993.

8. G. Graefe and R. Cole. Fast Algorithms forUniversal Quanti�cation in Large Databases.TODS, 20(2):187{236, 1995.

9. P. Gulutzan and T. Pelzer. SQL-99 Com-plete, Really: An Example-Based ReferenceManual of the New Standard. R&D Books,Lawrence, Kansas, USA, 1999.

29

10. S. Helmer and G. Moerkotte. Evaluation ofMain Memory Join Algorithms for Joins withSet Comparison Join Predicates. In Proceed-ings VLDB, Athens, Greece, pages 386{395,August 1997.

11. S. Helmer and G. Moerkotte. Compil-ing Away Set Containment and IntersectionJoins. In Technical Report 04/02, Universityof Mannheim, Germany, April 2002.

12. M. Houtsma and A. Swami. Set-OrientedData Mining in Relational Databases. DKE,17(3):245{262, December 1995.

13. P. Hsu and D. Parker. Improving SQLwith Generalized Quanti�ers. In ProceedingsICDE, Taipei, Taiwan, pages 298{305, March1995.

14. G. Jaeschke and H.-J. Schek. Remarks onthe Algebra of Non First Normal Form Re-lations. In Proceedings PODS, Los Angeles,California, USA, pages 124{138, March 1982.

15. M. Jarke and J. Koch. Range Nesting: A FastMethod to Evaluate Quanti�ed Queries. InProceedings SIGMOD, San Jose, California,USA, pages 196{206, May 1983.

16. A. Makinouchi. A Consideration on NormalForm of Not-Necessarily-Normalized Relationin the Relational Data Model. In ProceedingsVLDB, Tokyo, Japan, pages 447{453, Octo-ber 1977.

17. S. Melnik and H. Garcia-Molina. AdaptiveAlgorithms for Set Containment Joins. Tech-nical Report, Department of Computer Sci-ence, Stanford University, California, USA,November 2001.

18. S. Melnik and H. Garcia-Molina. Divide-and-Conquer Algorithm for Computing SetContainment Joins. In Proceedings EDBT,Prague, Czech Republic, pages 427{444,March 2002.

19. S. Melnik and H. Garcia-Molina. Divide-and-Conquer Algorithm for Computing Set Con-tainment Joins. Extended Technical Report,Stanford University, California, USA, 2002.

20. C. Nippl, R. Rantzau, and B. Mitschang.StreamJoin: A Generic Database Approachto Support the Class of Stream-Oriented Ap-plications. In Proceedings IDEAS, Yoko-hama, Japan, pages 83{91, September 2000.

21. K. Ramasamy. EÆcient Storage and QueryProcessing of Set-Valued Attributes. PhDthesis, University of Wisconsin, Madison,Wisconsin, USA, 2002. 144 pages.

22. K. Ramasamy, J. Patel, J. Naughton, andR. Kaushik. Set Containment Joins: TheGood, The Bad and The Ugly. In ProceedingsVLDB, Cairo, Egypt, pages 351{362, Septem-ber 2000.

23. R. Rantzau. Frequent Itemset Discovery withSQL Using Universal Quanti�cation. In Pro-ceedings Workshop on Database Technologyfor Data Mining (DTDM), Prague, Czech Re-public, pages 51{66, March 2002.

24. S. Rao, A. Badia, and D. v. Gucht. ProvidingBetter Support for a Class of Decision Sup-port Queries. In Proceedings SIGMOD, Mon-treal, Canada, pages 217{227, June 1996.

25. S. Sarawagi, S. Thomas, and R. Agrawal. In-tegrating Association Rule Mining with Re-lational Database Systems: Alternatives andImplications. In Proceedings SIGMOD, Seat-tle, Washington, USA, pages 343{354, June1998.

26. J. Smith and P. Chang. Optimizing the Per-formance of a Relational Algebra Data BaseInterface. CACM, 18(10):568{579, October1975.

27. S. Thomas and S. Chakravarthy. Perfor-mance Evaluation and Optimization of JoinQueries for Association Rule Mining. In Pro-ceedings DaWaK, Florence, Italy, pages 241{250, August{September 1999.

28. W3C. XQuery 1.0: An XML Query Lan-guage. Working Draft 7, W3C, 2001.

Appendix: Pseudo Code of Division Algo-

rithms

The following algorithms in Figures 13{17 as-sume that the division's input consists of a divi-dend table S(quotient; divisor) and a divisor tableT (divisor). Furthermore, we use the variables sand t to refer to a single row within S and T , re-spectively. The data structures dht and qht rep-resent a divisor hash table and a quotient hashtable, respectively.

30

s1 = s2 = s;while not t.isEmpty() do

insert t into dht;while not s1.isEmpty() do

if s1.quotient not in qht then beginwhile not s2.isEmpty() do

if s1.quotient == s2.quotient and s2.divisor in dht thenset bit of s2.divisor in dht;

if no bit in dht is equal to zero thenoutput row (s1.quotient);

reset all bits in dht to zero;insert s1.quotient into qht;

end;

Figure 13. Nested-Loops Division (Class 0)

// build the divisor hash tabledivisor_count = 0;while not t.isEmpty() do begin

insert t into dht;t.divisor_number = divisor_count;divisor_count++;

end;// build the quotient hash tablewhile not s.isEmpty() do

if a matching divisor row t in dht is found then beginif no matching candidate quotient row q is found in qht then begin

q = new quotient candidate row created from quotient attributes ofdividend row s including a bitmap initialized with zeroes;

insert q into qht;end;set bit in q's bitmap corresponding t.divisor_number;

end;// find result in the quotient tableforeach bucket in the quotient table do

foreach row q in bucket doif the associated bitmap of q contains no zero then

output row (q);

Figure 14. Classic Hash-Division (Class 0)

31

t_count = 0;while not t.isEmpty() do begint_count++;t.next();

end;if not s.isEmpty() then begins.next();current_quotient = s.quotient;

end;while not s.isEmpty() do begins_count = 0;while not s.isEmpty() and s.quotient == current_quotient do begin

s_count++;s.next();

end;if s_count == t_count then

output row (current_quotient);if not s.isEmpty() thencurrent_quotient = s.quotient;

end;

Figure 15. Merge-Count Division (Class 5)

is_first_row = true;while not s.isEmpty() do beginif is_first_row and not t.isEmpty() then begin

// this is the first time that we fetch a row from Ss.next();t.next();is_first_row = false;

end;current_quotient = s.quotient;while not s.isEmpty() and s.quotient == current_quotient and

not t.isEmpty() and s.divisor <= t.divisor do beginwhile not s.isEmpty() and s.quotient == current_quotient and

s.divisor < t.divisor dos.next();

while not s.isEmpty() and s.quotient == current_quotient andnot t.isEmpty() and s.divisor == t.divisor do begin

s.next();t.next();

end;end;if t.isEmpty() then// all divisor values of the divisor table have been matchedoutput row (current_quotient);

t.initialize(); // reopen the sorted divisor tableif not t.isEmpty() thent.next(); // fetch the first divisor row

while not s.isEmpty() and s.quotient == current_quotient dos.next();

end;

Figure 16. Merge-Sort Division (Class 10). Without loss of generality, the pseudo code assumes anascending sort order.

32

// build the divisor hash tablewhile not t.isEmpty() do

insert t into dht with a new bitmap initialized with zeroes;// build the quotient hash tablequotient_count = 0;while not s.isEmpty() do begin

if not s.quotient is in qht then begininsert (s.quotient) into qht;index = (s.quotient).quotient_number = quotient_count;quotient_count++;

elseindex = value of (s.quotient).quotient_number in qht;

end;d = result of lookup of s.divisor in dht;d.bitmap[index] = 1;

end;// find result in the divisor hash tableif number of rows in dht > 0 then begin

bitmap = new bitmap initialized with ones;foreach bucket in the dht do

foreach row d in bucket dobitmap = bitmap & d.bitmap; // bit-wise AND operation

foreach index value in bitmap == 1 do beginq = quotient row in qht associated with index;output row (q);

end;end;

Figure 17. Transposed Hash-Division (Class 0)

algorithms - portland state university

Documents