optimization of large join queries

Optimization of Large Join Queries

Arun Swama Anoop Gupta

Computer Science Department, Stanford Umverslty, Stanford, CA 94305

Abstract We mvestigate the problem of optlmlzmg Select-ProJect- Join queries with large numbers of JoIns Tahmg advantage of commonly used heunstlcs, the problem 1s reduced to that of determmmg the optimal Joln order This 1s a hard combmatorlal optlmlzatlon problem Some general techniques, such as iterative improvement and simulated annealing, have often proved effective In attackmg a mde variety of combmatonal optlmlzatlon problems In this paper, we apply these general algonthms to the large Jam query optlmlzatlon problem We use the statistical techniques of factonal expenments and analysis of variance (ANOVA) to obtam reliable values for the parameters of these algorithms and to compare these algorithms One mterestmg result of our experiments 1s that the relatively ,slmple iterative improvement proves to be better than all the other algorithms (included the more complex slmu- lated annealing) We also find that the general algorithms do quite well at the maxlmum time llxmt

1 Introduction

The problem of query optlxmzatlon m relational database systems has received a lot of attention ([JK84] provides a good overview) However, current query optlmlzers expect to process quenes mvolvmg only a small number of Jams (less than 10 Joins) We expect that novel apphcatlons built on top of relational systems wfl require processmg of quenes with a much larger number of Jams Knowledge base systems using relational systems for storage of persistent data are examples of possible apphcatlons ObJect-onented database systems, for instance, Iris [FBC*87], which use relational systems for storage of m- formation are another class of potential apphcatlons gen- erating many Joins Krlshnamurthy and Boral In [KBZSS] mention applications from logic programmmg resultmg in

PermIssion to copy wlthout fee all or part of this material 1s granted provided that the copies are not made or dlstrlbuted for direct com- mercial advantage, the ACM copyright notice and the title of the pubhcatloq and Its date appear, and notice IS given that copying IS by permission of the Assoclatlon for Computing Machinery To copy otherwise, or to republish, requires a fee and/or speclflc permission

0 1988 ACM 0-89791-268-3/88/0006/ooO8 $1 50

“ expressions (slrmlar to database queries) with hun- dreds (If not thousands) of Jmns n

In current query optlmlzatlon algonthms, algebrruc transformations and heuristics (such as pushing selections down as much as possible) have been used These will continue to prove useful However, strategies for searchmg large solution spaces are not as well developed For example, the System/R query optlmlzer [SAC*791 uses a dynamic programmmg algorithm whose worst case time complexity IS 0(2N) (with an exponential space requlre- ment), where N IS the number of Jams Use of this algorithm becomes mfeaslble as N increases beyond 10

In pW87], Ioanmdls and Wong study how simulated annealing [KGV83] can be used to obtam an efficient alge- brmc structure for a given recurs:ue query This too 1s a problem which involves searchmg a large solution space However, they do not consider the problem of optlmlzmg non-recursive queries with a large number of Jams

For optlmlzmg non-recursive queries, an O(N’) heurls- tic search algorithm has been described m [KBZSS] The theory on which their work 1s based requires that the cost functions have a certam form To be cast m this form, the cost functions have to be overslmphfied Also, even m the slmphfied form, all Jam methods do not have a cost function of the required form Our work does not depend on any particular cost model, any reasonable cost model will do For this particular set of experiments we use a cost model for JoIn processing m mam memory databases [Swa87a]

In this paper we mvestlgate the problem of optlmlzmg non-recursive large Join queries As will be shown later, the large Join query optlmlzatlon problem (to be denoted by LJQOP) 1s a hard combmatorlal optlmlzatlon problem Various general techniques 1 e , techniques which are apph- cable to a wide variety of problems, have been developed for tacklmg combmatorml optimlzatlon problems In this paper we adapt these techniques to LJQOP and compare them to determine which techmques are the most effective We also discuss the various problems that arise m perform- mg such a comparison We plan to investigate heurlstlc approaches m future work An example of such a heurlstlc 1s the one described m [KBZ86]

The optlmlzatlon techmques investigated m this paper are general algorithms like simulated annealing and the techniques based on local search [PS82] In particular, It 1s of interest to see how useful simulated anneahng proves

8

to be, smce experience with this technique has not been umformly posltlve [NSSSS], [JAMS871 Besides simulated annealing, techniques based on local optlmlzatlon, for example, lteratlve Improvement, have also been successfully employed m combmatonal optlmlzatlon problems Agam though, none of these techniques have been explored m the context of optimizing large Join queries

When comparmg different optlmlzatlon techmques, It 1s necessary to have a definite crltenon to evaluate them The cnterion we use m our study 18 based on the general prmclple that we consider a query optlmlzer to be good m practice If It performs well on the average, and very rarely performs poorly We ~11 discuss how we translate this prmclple mto an appropriate quantitative measure of the effectiveness of a technique We also discuss our use of comprehensive statlstlcal methods for both tuning the techniques (obtammg reliable values for the parameters of the techniques) and for comparmg the techniques

The orgamzatlon of this paper 1s as follows In Sec- tion 2, we describe the problem of large Jam query optlmlzatlon Assumptions and commonly used simple heuns- tics are discussed In Section 3, we describe the different general algorithms which are compared m this expenmental work In Sectlon 4, we describe how these algonthms have been adapted to the specific problem of query optlmlzatlon We also descnbe how test queries are generated In Section 5, we briefly describe the statlstlcal methods used and present the results of our experiments m tun- mg and comparmg the different algonthms Finally, m Section 6, we draw some conclusions and describe future research dlrectlons

2 Problem Formulation Consider a relatlonal algebra query urlth N Jams (by a Join, we mean a “two-way” Jam 1 e , a JoIn between two relations) hnkmg (N + 1) operand relations (some of these operand relations may represent the same base relations) In traditional database apphcatlons, N IS typically between 0 and 9 A large lorn query 1s a query where N 2 10 The large fo:n query optrmrtatron problem (LJQOP) 1s to find the query evaluation plan (QEP) having the lowest cost In this paper, we experiment with queries where N IS an order of magnitude larger than usual m queries m tradltlonal apphcatlons 1 e , we consider queries where 10 5 N < 100

As 1s common m query optlmlzatlon research, we re- stnct our work to quenes mvolvmg only selections, pro- jections, and Joins (Select-ProJect-Join quenes) We use the simple heurlstlcs of pushing selections down as much as possible, and performing proJectlons as soon as possible These heurlstlcs do not alter the combmatorml nature of the search space The maJor remammg problems m query optlmlzatlon are decldmg on the order m which we Jam the relations and the Jam method to be used for each Jam operation

In our current experiments, we decided to use only the hash Join method Slmulatlons based on the cost model m

[Swa87a] and other proposed models for query processmg using large mam memory [DKO*84] show that hash-based methods perform well over large ranges of values of the parameters Also, we need to estimate lower bounds before carry out the optlmlzatlon as shown m Section 4 2 It 1s a hard problem to obtam good estimates of lower bounds when other Join methods are included In future work, we will consider how to mcorporate the use of multiple Jam methods m the optlmlzatlon algorithms Now, our optlmlzatlon problem becomes one of choosmg the best Jam order A QEP can be concisely represented by a Join processmg tree, abbreviated as JT, and we wish to find a JT having the lowest cost Note that there may be many JTs with the same cost

&nary lorn processtng trees (BJTs) are JTs m which each Jam operator has exactly two operands A special class of BJTs 1s the class of lrnear lorn processrng trees (LJTs) In LJTs, of the twoJoin operands, at most one can be an intermediate relation Most Jam methods dlstmgmsh between the two operands, one being the “outer” relation and the other the “mner” relation An outer ltneur Joan process:ng tree (OLJT) 1s a LJT m which the inner relation 1s always a base relation, never an intermediate relation

The number of different OLJTs 1s (N + l)!, whereas the number of different B JTs 1s (‘l) Nl In our experiments, we generate only OLJTs Most query optlmlzers, mclud- mg System R, use the same restnctlon to cut down on the search space This 1s done based on the sssumptlon that a slgnlficant portion of the JTs mth low processmg cost 19 to be found m the space of OLJTs The vahdatlon of this assumption 1s an open problem Another advantage of using LJTs 1s that they provide increased opportum- ties for plpehmng Fmally, OLJTs, unhke LJTs where the intermediate relations are the inner relations, permit one to take advantage of indexes on Jam columns of base relations This 1s particularly important m nested Join methods which prefer JoIn indexes on the inner relations

Query optlmlzers restrict the space further by postponmg cross-products as late as possible (the mtmtlon IS that cross-products are expensive and result m large mterme- dlate results) However, this does not change the combl- natonal nature of the search space In fact, m [IK84], It has been shown that a special case of LJQOP IS an NP- complete problem This shows that the large Join query optlmlzatlon problem 1s a hard combmatorlal optlmlzatlon problem

The query optlmlzer 1s given a Jo:n graph representing the Join predicates lmkmg the different relations m the query If the query cannot be evaluated mthout usmg a cross-product, the Join graph will have at least two components Using the heurlstlc of postponmg cross-products as late as possible, all such components are processed sepa- rately and then the results are Joined using cross-products This means we need concern ourselves only with the optlmlzatlon of a single component It 1s clear that there exists at least one JT for processmg a component which does not require a cross-product Such a JT 1s called a ualrd JT We restrict our search to the space of all valid JTs

In the next section we introduce some termmology from

9

combmatorml optmuzation, and then describe different algorithms for tacklmg such problems These algonthms, adapted to our query optixmzatlon problem, wllI be compared m Section 5

3 Combinatorial Optimization Techniques

Each solution to a comhnatorial optimization problem can be looked upon as a state m a state space (in our case the JTs are the states) Each state has a cost associated with It as given by some cost finctlon A moue 1s a perturbation applied to a solution to get another solution, one moues from the state represented by the former solution to the state represented by the latter solution (We ml1 describe the moves we use m Section 4 1 ) A moue set M the set of moves avadable to go from one state to another Any one move 1s chosen from this move set at random The probab&ty sssoaated wth selecting any particular move 1s specified along mth the move set

Two states are smd to be ad3acent states (or nelghbour- rng states) If one move suffices to go from one state to the other A local mtntmum m the state space 1s a state such that Its cost 1s lower than that of alI neghbourmg states There can be many such local numma A gloM mrmmum IS a state which has the lowest cost among all the local rmmma There can be more than one global mlmmum The optimal solution 1s a global rmmmum A move which takes one to a state with a lower cost 1s called a downward move, othermse, It 1s called an upward move In a local or global nnmmum, alI moves are upward moves

Using this termmology, we now describe the combmato- rial optmuzation techniques that we will compare m Sec- tion 5 In all the techmques described below, selection among adjacent states 19 done at random Also, the mrttal state 1s the same for all algonthms and It may be gven or It may be generated at random The mltml state must be clistmgmshed from the start state which 1s the state m which a new local optimization run begms, as discussed below

Perturbation Walk (PW) This 1s the simplest technique The starting state 1s a randomly generated state One keeps moving from the current state to an adJacent state, remembering the lowest cost state visited The moves are chosen at random When the algorithm terrm- nates (the stopping criterion could be a tune hrmt, another stopping crltenon 1s discussed m Section 4 2), the lowest cost state encountered 1s output as the best solution found

Quasi-random Sampling (QS) This technique 1s somewhat smular to PW Instead of movmg to an adjacent state, one generates a new “random” solution Agam, when the algorithm terminates, the lowest cost solution generated 1s output as the best solution found The reason for the quahficatlon “quas? 1s that we do not generate truly random solutions Random sampling m the space of

/* Get an mitml solution */ s = lNtlallZ;e(),

/* Current mmimum cost solution */ minS = S,

repeat { repeat {

/* Randomly selected adJacent state */ news = move(S), if (cost(newS) < cost(S))

S = news, } until (Local mmlmum reached”),

If (cost(S) < cost(mmS)) minS = S,

/* Obtan a new starting state */ newStart(

} until (“stoppmg con&tion satisfied”),

return (nuns),

Figure 1 Local Optlmlzatlon

all JTs 1s simple (one generates random permutations) However, truly random samphng m the space of ualrd JTs 1s a hard open problem

Local Optimization This technique has a number of vanatlons, It 1s also referred to as local search [PS82] A move 1s accepted d the adJacent state being moved to 1s of lower cost than the current state If this 1s done repeat- edly, the optimizer attams a local mlmmum (which 1s not necessarily a global rmmmum), see Figure 1 The sequence of moves to a local mmimum from the start state IS termed a run

Two questions need to be answered before the algorithm m Figure 1 can be considered completely spe&ed (We defered the discussion of stoppmg cntena to Section 4 2 )

l How 1s the start state obtamed? The start state 1s the state m which we begn a run of local optimization For the first run, the start state 1s the rnrtlal state After a local mmimum IS reached, two variations on how to generate a new start state have been proposed [NSSSS] In rteratrue tmprouement (II), the start state 1s a state generated at random using, say, the same state generator as In QS In sequence heurlsttc (SH), the start state 1s obtamed by making a number of moves from the local rmmmum, except that this time each move 1s accepted irrespective of whether It increases or decreases the cost

l How 1s a local rmmmum detected7 A state usually has a large number of nelghbours It would be impractical to exhaustively enumerate all the nelghbours to verify that It 1s a local rmmmum Instead, an

10

/* Get an imtial solution */ S = imtiahze(), /* Set the 1rntia.l temperature */ T = nutmlTemp(),

repeat { repeat {

/* Randomly selected adJacent state */ news = move(S), delta = cost(newS) - cost(S), if (delta 5 0)

S = news, if (delta > 0)

S = news “with probability exp(-delta/T)“, } until (“inner-loop criterion 1s satisfied),

/* Reduce the temperature */ T = reduceTemp(T),

} until (“system has frozen”),

return (mmS),

Figure 2 Simulated Annealing

approximation based on random sampling 18 used We generate and test a large number of adJacent states If any one 1s of lower cost, we move to that state and start all over agam If no tested neighbour 1s of lower cost, the current state 1s taken to be a local nummum The number of states tested before a local xmmmum 1s declared 1s a parameter of both II and SH, and 1s called the sequence length (denoted by seqlength)

Simulated Annealing (SA) TUB techmque can also be viewed as a variation on local optmuzation, but It differs from II and SH m significant ways The simulated annealmg algonthm was origmally derived by analogy to the process of annealing of crystals from hqmd solution Hence, the terrnmology of physical processes UI commonly used e g , tempemture, freettng condrtron, though these physical analoges can be mspensed with (and, indeed, often are) once the algorithm and Its parameters have been described The general algorithm N shown m Figure 2

As m other local optnmzat~on techniques, moves which decrease the cost are always accepted However, m SA, moves which increase the cost can also be accepted at any time Such moves are accepted with a probatuhty which depends on the increase m cost entded by makmg the move (delta m the above algorithm), and a parameter called the temperature (T) The exponential form of the probability function 1s denved from the mathematical model of the annealing process Looking at the exponen- teal probability function m Figure 2, It IS clear that the higher the temperature, the more likely that an upward

move will be accepted Hence, the lugher the temperature, the higher the fraction of moves wluch are accepted Also, the probahhty increases with decreasmg delta This means that upward moves to a state of much higher cost are less hkely to be accepted

As before, a number of deta& need to be filled m the algorithm m Figure 2 A number of vanatlons are possl- ble and we will describe two important variations which we have used m our experiments The first varlatlon (described m [JAMS87]) 1s denoted by SAJ, and the second varration (described m [HRS86]) 1s denoted by SAH Some of the detals of these varrations are omitted for brevity, the reader 1s referred to the cited references for a complete descnption

l How 1s the ln1tm.l temperature (denoted by 2”) deter- mmed? In SAJ, To 1s taken to be that temperature at which a significant fraction (InrtProb) of all attempted moves 1s accepted A typical value for rnttProb 1s 0 4 To obtam To, one starts vnth the temperature bemg set to the cost of the untml state, and keeps doublmg the temperature until a value ls obtamed at which rnrtProb fraction of moves are accepted In SAH, To 1s taken to be K* CT, where u 1s the standard deviation of the cost distribution (estimated by some mltial sampling) A reasonably high value for K 1s 20

l How does reduceTemp work? In SAJ, the current temperature T 1s multiplied by a fixed reduction factor called tempFactor A typical value for tempFactor 1s 0 95 In SAH, the reduction factor 1s given by the value of the function max(O 5, exp(-XT/u)), where a typical value for X ~3 0 7 The lower bound of 0 5 1s used to prevent precipitous annealmg at hgh temperatures

l What 1s the inner-loop cntenon’ The inner-loop criterion determmes the number of times that the mner loop 1s executed This number 1s called the charn length In SAJ, the cham length 1s gven by srreFactor * N, where stzeFactor 1s a parameter Here the cham length 1s fixed throughout the annealing In SAH, the cham length 1s dynamically adJusted The idea 1s to allow u the establishment of a steady-state probablhty distnbution of the accessible states” [HRS86] We omit the detals of how this a.dJustment 1s done

l What IS the freezing condition? The freezing cond&on determines when the annealing process stops The mtuition 1s that the temperature 1s small enough that the probability of accepting upward moves 1s negligible, and very few downward moves are dwcovered In SAJ, the freezing condition consists of two tests The first test 1s whether there has been any improvement over the best solution value during the last 5 temperatures If this test fouls, then a check 1s made to see whether the percentage of accepted moves at the current temperature exceeds a parameter denoted by mrnpercent If neither test 18 satisfied, the system 1s sad to have “frozen”

In SAH, the difference between the maximum and mmlmum costs among the accepted states at the current tem-

11

perature 1s compared with the maxlmum change m cost m any accepted move dunng the current temperature If they are the same, the system 1s declared frozen smce “ apparently all the states accessed are of comparable costs, and there IS no need to use simulated anneahng [any further]” [HRS86]

4 Application to Query Opti- mizat ion

We now discuss how we adapted the general algorithms described m Section 3 to the optlmlzatlon of large Jam queries As explained In Section 2, the algorithms need to determine a good outer linear Jam processmg tree (OLJT) This 1s equivalent to determining an optimal (or, a good) ordering 1 e , permutation, of the relations The search 1s restricted to the space of v&d OLJTs (or, equivalently, v&d permutations) In this section, the permutation no- tation will be used to represent the correspondmg Jam tree A state, then, 1s a vahd permutation The Jam processing cost model 1s used to estimate the cost of a state

4.1 Move Set We now discuss the different kmds of moves Let S be the current state

s=( 1 J k )

The two kmds of moves that we use are Swap and 3CycZe

l Swap Select two dlstmct relations, say, 1 and J, at random Check If interchanging 1 and J results m a valid permutation If so, the move consists of swapping 1 and J to get the new state news

news = ( J 1 k )

l SCycle

Select three dlstmct relations, say, 1, J and k, at random The move consists of cycling 1, J and k 1 IS moved to the position occupied by J, J IS moved to the posltlon occupied by k, k 1s moved to the posltlon occupied by 1 We first check If the resulting permutation 1s vahd, and If that 1s the case, we get the new state news

news = ( k 1 J )

Our move set 1s then (Swap, SCycle, cu), (Y E [0, 11, where (Y 1s the frequency with which Swap IS selected (clearly, 1 --a 1s the frequency of Xycle) The parameter (Y needs to be adjusted for each technique

In most work on combmaton+l optlmlzatlon problems, a move 1s a small perturbation m the state to get another state This 1s satisfied by our choice of moves It IS also desirable that the cost of the new state be incrementally computable from the cost of the current state m order that testing the move B not expensive In the moves that we use, we usually do not have to traverse the entire Jam tree to compute the new cost These two kmds of moves

have been used m tackling other combmatorlal optlmlzatlon problems, and more complex moves can be regarded as composed of sequences of these moves

4.2 Stopping Criteria An important question m the design of the algorithms for query optlmlzatlon 1s that of stopping criteria One simple crlterlon 1s obtsmed by speclfymg a maximum time hmlt, and terminating the optlrmzatlon process if It exceeds the time allowed This crltenon IS useful independent of the other stoppmg criteria being used Hence, m addition to the other cntena, we use the crlterlon of a maxlmum time hmlt m all our experiments Indeed, the m-mum time allowed 1s a parameter, Just like other parameters of an algorithm

One could stop earher If one reaches a global mmlmum The problem 1s that of ldentlfymg a global nummum One must remember that the size of the problems 1s such that exhaustive search to obtrun the global mmlmum for the benchmark queries 1s lmpractlcal The best we can do 1s to estimate a lower bound on the cost of a global mmlmum, and use this m the stopping criterion One natural way to use the lower bound 1s to stop when the optlmlzer obtams a solution whose cost 1s sufficiently close to the lower bound In our experiments, we define “sufficiently close” to be vvlthm twice the estimated lower bound

To use this stopping cnterlon, we need to estimate a

lower bound In comhnatonal optlmmatlon problems hke the Traveling Salesman Problem (TSP), the theory has been suffiaently developed to permit very good estimates of the lower bound This 1s not the case m query optlmlzatlon We currently use comparatively crude estimates of the lower bound Essentially we sum the costs of all the processmg which IS independent of the size of the mterme- dlate results e g , the costs of scanning the relations or the costs of creating the hash tables

Smce we do not know how good the lower bound IS, we cannot always assess, m an absolute sense, the quality of the solution produced by an algorithm Hence, one cannot make very definitive statements about an algonthm m lsolatlon However, compamsons between algorithms are not affected Also, the use of a bad estimate of the lower bound m the stopping crlterlon only means that the algorithm may run longer than needed, on the average, the quality of the solution produced 1s not affected

4.3 Is the Algorithm Good? It remams to decide on a measure of the goodness of an algorithm One often needs such a single measure e g , It 1s needed for the facto4 experiments discussed m Section 5 From a practical point of view, we regard an algorithm as being good If It performs well on the auemge, and very mrely performs poorly We wish to translate this somewhat vague speclficatlon into a quantitative measure In order to compare the performance of the algorithms on different queries, we need to scale the cost of the solutions obtained The scaling IS performed by dlvldmg by the cost of the

12

best solution obtamed Thus, the solution quality 1s a dlmenslonless number greater than or equal to 1

It would be easy enough to simply compare the means of the scaled solution costs However, this would not nec- essarlly reflect the crlterlon stated in the above paragraph as the followmg (hypothetlcal) example shows Example: Two algonthms A1 and A2 are bemg compared The same quenes were optimized using both A1 and AZ, and the followmg results were obtamed (we show the scaled solution costs along with the percentage of the queries havmg these costs)

A1 SO%- 1, 30%- 2, 10% - 10 A2 40%- 1, 40%-2, 16% - 5, 4% - 10

If one computes the simple means mean(A1) = 2 20, mean(A2) = 2 40

then It would appear that Al was better than A2 How- ever, accordmg to the crlterlon we have specified, we would actually prefer A2 0

To resolve this problem, we define the notion of an outlying value An ouMy:ng value 1s a final solution cost (obtamed by some algorithm) which 1s much higher than the best solution cost Intmtlvely, an algorithm performs poorly on a particularly query If the solution cost obtamed by the algorithm 1s an outlymg value In our experiments we define a solution cost to be an outlymg value If It 1s at least 10 times the best solution cost Note that there 1s no problem in usmg the best solution cost as a standard because these are comparative experiments, and the anal- ysls IS performed after all the algorithms have optmuzed the queries

A better measure than the simple mean 18 obtamed as follows First we compute the mean ezcludcng the outlying values (denote this “tmnmed” mean by a) We compute the count of the outlylng values (denote this by b) Our measure of goodness of the algorithm 1s then a + b The reason we Just count the outlying values 1s that once a solution IS considered poor, we are not much Interested (from a practical pomt of view) m how poor it 1s For the same reason, we do not Include the outlying values m the computation of the mean, they would skew the mean too much, the mean 1s not a very robust statlstlc The trimmed mean a tries to capture the performance of the algorithm “on the average”, and b tnes to quantify how often the algorithm “performs poorly n

We have slmpllfied thmgs somewhat, we actually add the square root of b to a An explanation of this uar:- ante stabrhng transformahon (and other such transfor- matlons) can be found m [BHH78] Hence, our measure IS a + fi Usmg this new measure m our example, we find

al=133, bl=m=316 a2=208, b2=4=2 al + bl = 4 49, a2 + b2 = 4 08

Now a1 + bl > a2 + b2, and our measure agrees with our mtmtlon that A2 1s better than Al

4.4 Query Generation To compare the various algorithms in a comprehensive manner using the statlstlcal methods discussed m Sec- tion 5, we need to generate a large number of queries This 1s done as follows Dlstrlbutlons for various parameters of the queries are specified Values for these parameters are then obtained usmg random numbers distributed accordmgly Different queries are generated by speclfymg different lmtml seeds This enables us to obtam a large number of “random” queries

N, the number of Joms, 1s allowed to take values 10 through 100 (the number of Joming relations 1s N + 1) The Jom graph 1s generated as follows In the mltlal permutation (1 2 3 N N + l), a connected Join graph is obtained by using N Joins to make the permutatlon a v&d permutatlon Observmg this constramt, the relation to JoIn with a given relation 1s chosen at random Then, upto N addItional Joms are asslgned (agam at random) The Joming attributes are selected at random

The relatron car&n&y 1s the number of tuples m a relation Each relation can have selection predicates which restrict the tuples of the relation which partlclpate m Joins mth other relations The number of dcstmct values m a JOUI column 1s an important factor m determmmg the size of intermediate results Fmally, mdexes can be used ss efficient access paths

The features which characterize mdlvldual relations are dlstrlbuted as follows

l Relation Cardmahtles [lo, 100) - 20%, [loo, 1000) - 64%, [lOOO, 10000) - 16%

0 Selections The number of selectlon predicates per relation ranges from 0 to 2 The selectlvltles of the selection pre&cates were chosen randomly from the followmg hst

0 001, 0 01, 0 1, 0 2, 0 34, 0 34, 0 34, 0 34, 0 34, 0 5, 0 5, 0 5, 0 67, 0 8, 1 0

l Dlstmct Values m JoIn columns (as a fraction of the relation cardmahty)

(0, 0 2] - 75%, (0 2, 1) - 5%, 1 0 - 20%

l Indexes About 25% of the columns involved m Jam predicates were selected to have mdexes (the actual columns were chosen at random)

We model a dlstnbutlon of the relation cardmalltles where we have a small but not magnlficant number of both small and large relations, and a maJorlty of the relations are of medium size The avadable selectlvltles were chosen so that the selectlvltles of l/3 and l/2 (estimates used by many query optlmlzers like System/R) were the most common, and there are a few small and large selectlvltles In decldmg on the dlstrlbutlon of distinct values, we took mto account that columns with unique values (e g , key columns) are often present The remammg columns are assumed to have a much smaller number of dlstmct values This works out to about 10% of the relation cardmabty on

13

the average Agam, this 1s an estimate often used We have not seen any studies on the average number of indexes and chose the figure above as a reasonable one

5 Experimental Comparison of Algorithms

The different algorithms were coded m C, and the expenments were run on very hghtly loaded HP 9000/350 work- statlons The workstatlon 1s based on a 25 MHz 68020 processor, and 1s approximately a 4 MIPS machme Note that the optrmrzer programs are completely CPU bound, memory reqmrements are neghglble compared to the mam memory of the workstations

For the experiments on tumng the parameters of the algorithms, we used 50 lfferent queries for each of N = 10,20,30,40,50, glvmg a total of 250 different quenes Each algorithm was run twrce on each query (using drfferent lmtlal seeds), thus grvrng two replrcates per query which were then averaged For the expenments on com- paring the algonthms, we used 50 different quelnes for each of N = 10,20,30,40,50,60,70,80,90,100, gvlng a total of 500 drfferent querres We obtamed four replicates for each algorithm per query The entlre set of expenmental runs took about SIX weeks wrth the programs runnmg most of the time

5.1 Tuning of Parameters Before comparmg the different algorithms, we need to ensure that the parameters of the algorrthms are set to “good” values An example of a parameter (whrch hap- pens to be common to all the algonthms except QS) rs (Y, the frequency wrth whrch Swap rs selected Parameters partrcular to the algonthms are

II, SH seqLength SAJ tempFactor, rnrtProb,

arteFactor, mrnPercent SAH K, sue, m:nAccept, M (see [HRS86])

To obtarn rebable values for these parameters, we use the methodology of factor& experrments [Dav78], [BHH78] To be more precrse, we use 2” factorial and fractional factorml experrments, where n 1s the number of parameters Usually, each parameter (or factor) can take on a number of values, and rt 1s very expensrve or time consummg to try all possrble combmatrons of sll possible values of the factors The Idea rn 2” factor& expenments IS to consider, m one experrment, two levels (called low and hrgh) of each factor, and try out all combmatrons of these levels

ctatmtrcal theory enables us to estrmate the marn effects of the parameters and therr mteractrons We then vary the values of the parameters rn the drrectron of mcreasmg ben- efit If the parameter has no srgnrficant effect, we reduce the range 1 e , we decrease the separation between the low and the hrgh levels In fractronal factorml experiments,

we do not try all combmatrons of the levels, but only a fiactcon of these combmatrons rn such a way that we can strll estimate the effects of interest to us Detarls of thus methodology and its advantages over the more common “one parameter at a time” method are explamed rn the references crted earher An example rllustratmg the use of factorml experiments rs given 111 [Swa87b]

5 1.1 Time is a Factor

In addltron to the parameters mentioned above, we also mcluded trme as a factor By trme we mean the maxrmum trme allowed for one optrmrzatron run, tks rs used rn the stoppmg crrterron drscussed rn Section 4 Thrs enables us to determme rf a particular algorrthm shows no mgnrficant change m the qualrty of the solutrons produced beyond a certam time lrmrt (we say that the algorithm “saturates” at thus time lrmrt) The performance of the algorithm at the saturatron trme hmrt approxrmates Its performance d rt were potentrally gven unhmrted trme

Introducrng trme as a factor allows us to drscover and study mteractrons between time and other factors (If such mteractrons exmt) The trme lrmrt for a query rs propor- tional to N2, wrth the constant factor changing for drfferent trme lrmrts When we mention time kmrts we wrll specrfy rt for a single value of N, the correspondrng time hmrts for other values of N can be easrly deduced For example, rf the trme lrmrt at N = 50 IS 7 5 mmutes, the correspondmg trme hmrt at N = 100 rs (9)’ * 7 5 = 30 minutes

5.1.2 Results

The results of tumng the parameters usmg factorial experiments are grven below (all saturatron times are rn minutes and are given for N = 50)

II seqLength = N, tame = 7 5

SH seqliength = N, trme = 7 5

SAJ sueFactor = 1, tempFactor = 0 975, m:nPercent = 2, rnrtProb = 0 4, trme = 7 5

SAH K = 20, sue = 2N, manAccept = 2N, M = N2/2, ttme = 7 5

PW tame = 5

QS t:me=75

The parameter LY takes the value of 0 5 for all the algorrthms Actually, no particular value of Q proved better than any other value We arrrved at the value of 0 5 by our method of decreasing the range whenever a parameter had no wgmficant effect

5.2 Comparison of Algorithms The maxrmum saturatron trme 1s 7 5 mmutes at N = 50 This trme 1s 30 minutes at N = 100 We could compare the algorrthms only at thus saturatron trme However, It 1s possrble that the ordering among the algorithms may change wrth drfferent time bmrts Hence we decuied to

14

compare the algorithms for a range of time hmlts upto the time hmlt of 30 mmutes The time hmlts we experimented with are 1 minute, 2 minutes, 10 mmutes, and 30 mmutes (all these times are specified at N = 100)

To compare the algorithms we used the techniques of Analysis of Varmnce (ANOVA) [DW83] The Idea IS to run the algorithms on a large number of different queues Now, we wish to 8ee If there 1s any slgmficant difference among these algorithms ANOVA checks ths by comparmg the spread among the algonthms mth the spread w:thrn an algorithm If these are comparable, then no slgmficant difference can be detected If not, the algorithms differ m their performance

The spreads are compared using the followmg standard F-test on the ratio of two mean sum of squares (MSS) Let MSS, measure the spread among the algorithms, and MS&, measure the spread wlthtn the queries optlmlzed by an algorithm Then, the probabtity of the ratio MS&/MS&, (thm will be denoted by Fab below) attammg or exceedmg this value accordmg to the F dlstrlbutlon 1s obtamed (tks probablhty 1s expressed as a percentage below) The lower the probability the more slgmficant 1s the difference among the algorithms As m the factorial expemnents, percentages higher than 10% m- dlcate statrsttcdly mslgmficant hfferences

If the test mdlcates that there exists a slgmficant dlf- ference among the algorithms, we would hke to group the algorithms according to how well they perform 1 e , algonthms wkch do not differ among themselves would be m the same group The way we do this 1s to examme the mean performance of the algorithms to see If we can ldentlfy any groups Note that we are usmg means of the scaled solution values 1 e , the solution values are dlvlded by the best solution values Then we verify that our groupmg 18 correct by checkmg that

l the algorithms m a group do not &ffer among themselves

l when an algorithm from one group 1s included in another group, there 18 a wgmficant difference among algorithms m the new group

We use the same F-test for ldentlfymg slgmficant dlffer- ences (only now we apply the test to subsets of the set of all algonthms) We can then order the groups (and the algorithms wlthm a group) accordmg to their means An example lllustratmg this methodology 1s given m [Swa87b] Another good example 18 to be found in [NMF87]

5.2.1 Ordering Among Algorithms

In Table 1, we present the results of our comparative ex- perlments At each time, we ldentlfy the algorithm groups by enclosmg them m brackets The algorithms which perform better come earher In the sequence We also give the means of the scaled solution costs, the cost have been scaled by dlvldmg by the best solution costs obtamed at 30 mmutes Care should be taken m mterpretmg these means as was noted In Sectlon 4 3

Time Groups and Means 1 minute [II] [SAH, SH, QS, SAJ] [Pw] ’

[l 621 [2 74, 2 92, 2 95, 3 091 [3 711 2 mmutes [II] [SAH, QS, SH, SAJ] [Pw]

[l 621 [2 69, 2 74, 2 83, 2 881 [3 381 10 minutes [II] [SAJ, QS] [SAH, SH, Pw] ”

[1 341 [2 27, 2 351 [2 47, 2 56, 2 691 30 minutes [II] [SAJ] [QS] [PW, SH, SAH]

[l 171 [l 821 [2 151 [2 34, 2 39, 2 451

Table 1 Comparison of Algorithms

4 A V e r

g” e 3

S C a 1

: 2 C 0 8 t 8

1

4 * II * SAH 4 l SAJ o SH

4 PW

; oQS

*x. 4

8, 8 , 4 0

**

*

12 10 20

Time (mmutes)

30

Figure 3 Sensitivity to Tune

Clearly, rterat:ue :mprouement (II) 1s superior to all the other algorithms over all time hnuts Usmg more complex algorithms like simulated annealmg does not result m better performance In fact, the performance of II at 1 minute (mean = 1 62) 1s better than the performance of SAJ (mean = 1 82) at 30 mmutes Note that SAJ 1s the second best algonthm at 30 mmutes The sequence heurrs- trc (SH) and perturbatron walk (PW) are clearly inferior at all the time hmlts

One possible explanation of these results 1s that the so- lutlon space has a large number of local mlmma, with a small but slgmficant fraction of them being deep local mmlma II can traverse large regions of the search space and thus stands a good chance of finding one of the deep mmlma QS too traverses large portions of the search space, but II does better because It always ends up m a local mmlmum PW and SH do badly because they do not travel far from the startmg state The simulated annealmg algorithms do not travel as much as II because they never make total random moves once the mltlal state IS chosen However, they travel more than PW and SH as they can accept cost mcreasmg moves We are workmg towards characterlzmg the search space better We hope

15

Al2 r,, V e

t

ON = 10 f a g

ON = 40 g ** *N = 70 e o N = 100 S

f 6

t

1

:

g; f , fi

012 10 20 30

Time (mmutes)

Figure 4 Performance of II with Time

then to come up mth a better explanation of these results However, as the time gven mcreases, simulated anneal-

ing can Improve Its performance, since It can travel further Thus, the simulated annealmg algorithm described in [JAMS871 (SAJ) moves from bemg the second worst algorithm at 1 minute to bemg the second best algorithm at 10 and 30 mmutes It 1s not clear why the simulated annealing algorithm described m [HRS86] (SAH) shps m the rankmg

We also found from the data that at 30 minutes less than 10% of the best solutions were outlying values 1 e , greater than ten times the lower bound Also, the means of the best solutions which were not outlymg values were close to twice the lower bounds Note that we halt the algorithms once they achieve a cost which IS less than twice the lower bound All this means that at the saturation time algonthms like II perform reasonably well m solving the large Joln query optlmlzatlon problem (LJQOP)

5 2.2 Effect of Time

In Table 1, we find that as the time hmlt increases, the algorithms other than lteratlve improvement spht mto smaller groups This means that the performances of some algorithms are more sensltlve to time than others To ll- lustratc this, we use the data m Table 1 to draw Figure 3 We can see In the figure how the groups spht up as some algorithms improve more rapidly with time than others All the algorithms Improve slgmficantly gomg from 1 minute to 10 minutes, except for SAJ, the improvement m going from 10 minutes to 30 minutes 1s not so marked

We investigate II further since It proves to be clearly superior In Figure 4, we show how the average scaled costs obtamed by lteratlve improvement (II) changes with time for N = 10,40,70,100 We graphs the costs at 1 minute, 2 minutes, 10 minutes, and 30 minutes In all cases, the costs have been scaled usmg the best solutions

obtamed at the time limit of 30 minutes We now see that the large improvement m performance when time 1s increased from 1 nunute to 10 minutes holds mamly for larger N, for N = 10, there IS very httle improvement We observe that for all N there 1s not much improvement when time 1s increased from 10 minutes to the saturation time This 1s an interesting fact wkch may be used when optlmlzmg queries wkch are expected to be executed only a few times In that case, one may obtam sufficiently good solutions at times much less than the saturation time

6 Discussion

The large Jam query optmuzatlon problem (LJQOP) 1s a

hard combmatorlal optlmlzatlon problem Such problems have been tackled using general techniques such as those based on local search Another approach 1s to use heurls- tics, such as “divide and conquer” which can drastically reduce the search space In this paper we have mvestl- gated the first approach to LJQOP We have described the algorithms of perturbation walk, quasi-random sampling, iterative Improvement, sequence heunstlc, and slm- ulated annealmg We showed how these techmques can be adapted to LJQOP

We described how these algorithms can be tuned usmg factorial experiments By usmg time as a factor, we were able to rehably determine the saturation time, that 1s the hmlt beyond which increasing time shows no slgmficant improvement m the performance of the algorithm Also, the fact that the parameter (a) whch gves the fraction of Swap moves to SC&e moves has no effect leads us to hypothesize that our choice of simple moves 1s adequate

We then described how we used analysis of variance (ANOVA) to compare the different algorithms Iterative improvement 1s superior to all the other algorithms at all the time bmlts Also, It seems that simulated annealing by :tselfls not an useful technique for LJQbP We suggested that one possible explanation of our results of the comparison of the algorithms IS that the search space has a large number of local mmlma, with a small but significant fraction of them being deep local mmlma, and II does well because it. can traverse large regions of the search space

We also showed that at the saturation time over 90% of the best solutions were quite close to our target of twice the lower bound The good quahty of the solutions obtamed indicates that we can do reasonably well m solving LJQOP using Just these general algorithms We also showed that the performances of all algorithms except SAJ improve rapidly up to a certam time hmlt beyond which much smaller improvements are obtamed This time hmlt 1s less than the saturation time This can be used to help decide how much time to spend on optlmlzmg a query If a query will be used only a few times, one can obtam sufficiently good solutions using a time hmlt smaller than the saturation time

Our work can be extended m vanous ways In future, we intend to include Jam methods other than the hash Join method We also plan to mcorporate other cost mod-

16

els e g , a cost model for disk-based query processmg In addition, we will look at large Join queries generated by strategies which try to introduce “clusters”, these may model a number of useful kmds of queries

We intend to investigate the other important approach to combmatorlal optlmlzatlon problems viz , use of heuns- tics It would be interesting to compare these heurlstlcs to the algorithms we have investigated Finally, a combma- tlon of general algorithms and heurlstlcs may prove to be useful, this also needs to be investigated

Acknowledgements The first author acknowledges Prof Glo Wlederhold, Pe- ter Lyngbaek and Mane-Anne Nelmat for their encour- agement and support, and for helpmg by way of dlscus- slons and comments on earlier drafts He also thanks Tim Read for advice regarding statlstlcal theory and practice Arun Swam1 IS supported by Hewlett-Packard Labora- tories under the contract titled “Research m Relational Database Management Systems” and, earlier, was supported by DARPA contract N00039-84-C-0211 for Knowl- edge Based Management Systems Anoop Gupta 15 supported by a faculty award from Digital Equipment Corpo- ration

References

[BHH78] G E P Box, W G Hunter, and J S Hunter Stotrstrcs for Ezperrmenters John Wiley and Sons, 1978

[Dav78] 0 L Davies, editor The Descgn and Anoly- 31s of Industraol Experrments Longman Group Limited, 2nd edition, 1978

[DK0*84] D J Delmtt, R H Katz, F Olken, L D Shapiro, M R Stonebraker, and D Wood Implementation Techniques for Mam Memory Database Systems In Proceedrngs of ACM- SIGMOD Internotronol Conference on Man- agement of Data, pages l-8, June 1984

[DW83] S Dowdy and S Wearden Stotrstrcs for Re- search John Wiley and Sons, 1983

[FBC*87] D H Flshman, D Beech, H P Cate, E C Chow, T Connors, J W Davis, N Derrett, C G Hoch, W Kent, P Lyngbaek, B Mah- bod, M A Nelmat, T A Ryan, and M C Shan Ins An ObJect-Oriented DBMS ACM Transactrons on O&e Informatron Systems, 5(l) 48-69, January 1987

[HRS86] M D Huang, F Romeo, and A Sanglovanm- Vmcentelh An Efficient General Coohng Schedule for Simulated Annealing In Pro- ceedtngs of the 1986 IC CAD Conference, pages 381-384, 1986

[IK841 T Ibaralu and T Kameda Optimal Nest- mg for Computmg N-relational Joins A CM Tronsoctrons on Database Systems, 9(3) 482- 502, October 1984

[IW871 Y E Ioanm&s and E Wong Query Optlmlza- tlon by Simulated Annealing In Proceedrngs of ACM-SIGMOD Internatronal Conference on Management of Data, pages 9-22, 1987

[JAMS871 D S Johnson, C R Aragon, L A McGeoch, and C Schevon Optlmlzatlon by Simulated Annealing An Experimental Evaluation (Part I) June 1987 Draft

[JK84] M Jarke and J Koch Query Optlmlzatlon m Database Systems ACM Computrng Surveys, 16(2) 111-152, June 1984

[KBZSS] R Knshnamurthy, H Boral, and C Zamolo Optlmlzatlon of Nonrecurslve Queries In Pro- ceedtngs of the Twelth Internotconal Conference on V&y Large Data Bases, pages 128-137, Ky- oto, Japan, 1986

[KGV83] S Klrkpatnck, C D Gelatt, and M P Vecch Optmuzatlon by Sunulated Anneahng SC:- ence, 220(4598) 671-680, May 1983

[NMF87] R E Nance, R L Moose, and R V Foutz A Statlstlcal Technique for Comparmg Heurla- tics An Example from Capacity Assignment Strateges m Computer Network Design Com- munrcutrons of the ACM, 30(5) 430-442, May 1987

[NSSSS] S Nahar, S Sahm, and E Shragomtz Slmu- lated Anneabng and Combmatonal Optlmlza- tlon In Proceedmgs of the 23rd Desrgn Au- tomotton Conference, pages 293-299, 1986

[PS82] C H Papadlrmtnou and K Stelgbtz Com- btnotorrol Optrmrzobon Algorcthms ond Com- plexrty Prentice-Hall, 1982

[SAC*791 P Selmger, M M Astrahan, D D Cham- berhn, R A Lone, and T G Price Ac- cess Path Selection m a Relational Database Management System In Proceedtngs of ACM- SIGMOD Internotronal Conference on Mon- ogement of Data, 1979

[Swa87a] A Swam1 A Cost Model For Memory Resldent Databases April 1987 Computer Science De- partment, Stanford Umverslty

[Swa87b] A Swam1 Optcmrzotron of Large Joan Quer:es Technical Report, Software Technology Labo- ratory, Hewlett-Packard Laboratories, Novem- ber 1987 Report STL-87-15

17

optimization of large join queries

Documents