# Efficient optimization of large join queries using Tabu Search

Post on 15-Jul-2016

212 views

TRANSCRIPT

NORTH- HOLLAND

Efficient Opt imizat ion of Large Join Queries Using Tabu Search

MACIEJ MATYSIAK Institute of Computing Science, Poznad University of Technology, ul.Piotrowo 3a, 60-965 Poznad, Poland

ABSTRACT

Finding an optimal query execution plan for join queries is a hard combina- torial optimization problem. In order to cope with complex large join queries, combinatorial optimization algorithms, such as Simulating Annealing and It- erative Improvement, were proposed as alternatives to traditional enumerative algorithms. In this paper, the relatively new combinatorial optimization tech- nique called Tabu Search is applied. Considering various query sizes, query graph shapes, and optimization times, it is shown that Tabu Search almost al- ways obtains better query execution plans than other combinatorial optimization techniques.

1. INTRODUCTION

A query optimizer in a relational database management system trans- lates nonprocedural query into a procedural plan for execution by generat- ing many alternative query execution plans (QEP), estimating the execu- tion cost of each, and choosing the plan having the lowest estimated cost. Computational complexity of the optimization process is determined by the number of alternative QEPs that must be evaluated by the optimizer. In general, the number of alternative plans grows exponentially with the number of relations involved in the query. Traditional query optimizers deal with queries involving only a small number of relations. Therefore, they can use enumerative optimization strategies which consider most of the alternative QEPs [12, 13, 15].

The complexity of the query optimization problem increases when we consider nontraditional applications, such as decision support systems, ex- pert systems, knowledge base systems, object-oriented database systems, and applications from logic programming. These applications tend to pose much more complex queries referring to more relations and requiring pro-

INFORMATION SCIENCES 83~ 77-88 (1995) (~) Elsevier Science Inc., 1995 655 Avenue of the Americas, New York, NY 10010

0020-0255/95/$9.50 SSDI 0020-0255(94)00094-R

78 M. MATYSIAK

cessing more joins than traditional applications [2, 10, 13]. For these appli- cations, the enumerative optimization strategies are inadequate, even with the above-mentioned heuristics, because they face a combinatorial explo- sion of alternative QEPs to generate and evaluate.

Combinatorial strategies for query optimization were first described in in [8]. In [16, 17], Swami and Gupta investigated the problem of using Simulated Annealing (SA) and Iterative Improvement (II) for optimizing nonrecursive large join queries. Later, Ioannidis and Kang [6, 7] proposed a new hybrid algorithm (called 2PO) that combines II and SA. Recently, Swami and Iyer [18] proposed a new polynomial time algorithm that com- bines combinatorial searching with the enumerative neighborhood search algorithmn proposed earlier by Ibaraki and Kameda [5].

In the paper, the relatively new combinatorial optimization technique called Tabu Search (TS) [3, 4] is proposed to apply to optimization of large join queries. The major contribution of the paper is the comparison of the performance of Tabu Search with the performance of Simulating An- nealing and Iterative Improvement. The performance of these techniques is analyzed with respect to different query types (linear, star, and bushy), dif- ferent query execution plan types (outer linear join processing trees, bushy join processing trees), and different sizes of queries. It is shown that TS almost always produces a better QEP.

The paper is organized as follows. In Section 2, the problem of large join query optimization is formulated in terms of the combinatorial optimiza- tion problem. Section 3 describes the Iterative Improvement and Simulated Annealing algorithms. The Tabu Search is discussed in Section 4. The ap- plication of combinatorial techniques to query optimization is presented in Section 5. Section 6 and 7 show the parameters and results of experiments.

2. PROBLEM FORMULATION

The execution cost of a relational query depends on the order in which relational operators involved in the query are performed. This order rep- resents a query execution plan (QEP). For a given query, there might be many alternative QEPs. The query optimization problem is to find QEP having a cost as low as possible.

It has been shown [1, 9, 15] that using the heuristics of performing selections and projections as early as possible and excluding unnecessary Cartesian products, we can eliminate from consideration certain subop- timal QEPs. Thus, considering the join operation as a 2-way join, the optimizer must select the best sequence of 2-way joins to achieve the N- way join of relations requested by the query. Each join in this sequence may have its own performing method (nested-loop, sort-merge, hash-based,

EFF IC IENT OPTIMIZATION OF JOIN QUERIES 79

etc.) Most join methods (such as, e.g., nested-loop) distinguish between the two operands, one being the outer relation and the other being the inner relation.

Thus, the query optimization problem is reduced to finding the order in which relations should be joined, together with the best join method for each 2-way join and the best argument assignment of each join. In the paper, the large join query is a query for which the number of required join operations is equal to or greater than 10.

3. COMBINATORIAL ALGORITHMS

3.1. INTRODUCTION AND TERMINOLOGY

Each solution to a combinatorial optimization problem is considered as a state in a state space. Each state S has a cost associated with it, cost(S), as given by some cost function. The aim of the optimization algorithm is minimizing the cost(S) by performing random walks in the state space. The walk consists of a sequence of moves. A move is a transformation applied to a state to get another state. The states that can be reached in one move from a state S are called neighbors of S. A move is called downward (upward) if the cost of the source state is lower (higher) than the cost of the destination state. A state is called local min imum if its cost is equal to or lower than that of all its neighbors. A state is called global min imum if it has the lowest cost among all states in the state space. The optimal solution is a global minimum.

3.2. ITERAT IVE IMPROVEMENT (II) AND SIMULATED ANNEALING (SA)

The II is a very simple metaheuristic procedure, characterized also as a valley descending technique. It does as follows. The starting state is selected randomly. Then, II walks downward, choosing any of the neigh- boring states that decrease the cost until it reaches the local minimum. II repeats the local optimization procedure until a stopping condition is sat- isfied, and then the local minimum with the lowest cost found is returned. Further information about II can be found in [6, 16].

In contrast to II, Simulating Annealing investigates the state space by performing both downward and upward moves. It always accepts downward moves, but upward moves are being accepted with a probability which depends on the increase in cost between a neighbor and a current state, and a parameter called the temperature (T). The higher the temperature or the smaller the cost difference, the more likely that an upward move will be accepted. The single optimization consists of several walks through

80 M. MATYSIAK

the state space. The walks end when some inner-loop criterion is satisfied. Then, the temperature is reduced according to some function and another walk begins. The algorithm stops when the system is considered to be frozen, i.e., when the temperature is equal to zero. The formal description of this algorithm can be found in [6, 8, 16].

4. TABU SEARCH

The main idea of TS consists of an exploration of a state space re- membering a path of recently visited states. This path (called tabu list) constitutes a set of restrictions which are used to prevent the reversal, or sometimes repetition. It also induces the search to follow a new trajectory if cycling in a narrower sense occurs.

TS does as follows. The starting state is a randomly generated state. To make a move, the iterative procedure generates a sample V* of states from among the set of neighbors N(S) of the current state S. A best state S* in V* is determined and a move from S to S* is made, irrespective of whether it is the downward or upward move. A number of recently visited states is kept on tabu list T. It forbids moves which should bring the algorithm back to a previously visited state. Therefore, whenever at a state S a set V* of states in N(S) has to be generated, we check that the candidate for membership of V* is not in T. The iterative procedure stops if some stopping condition is satisfied. It may be, for instance, lack of improvement of the best solution during a number of kmax iterations. The general Tabu Search algorithm is presented in Figure 1.

In the study, a modified version of TS is used. Instead of randomly generating an initial state, the descend procedure is used to find a local minimum. Instead of storing in T a set of visited states, a set of moves is kept. It reduces the space and time necessary to check the tabu restrictions. In order to compare TS with other algorithms, a time limit instead of the number of iterations is used to stop the optimization process.

5. APPLICATION OF TABU SEARCH

5.1. MOVE SET

The QEP is a single state in a state space and is represented by a string of joins, where each join has its own performing method and two distinguished arguments. A move is a single modification applied to a state to get another state. In order to ensure the possibility of reaching every state in a state space, three types of moves are defined: join method exchange, join method argument exchange, and change of join operator order.

The jo in method exchange is performed as follows. One join operator

EFF IC IENT OPTIMIZATION OF JOIN QUERIES 81

procedure TS() { /* Get an initial solution */ S = initialize(); rains = .5"; /* set a tabu list */ T=0; repeat {

generate V* __ N(S) - 7'; choose best ,5'* E V*; S=S*; /* update tabu list T */ T = (T - (oldest))U(S); if cost(S) < cost(rainS) then rains = S; )

until (stopping condition); return (rainS);

Fig. 1. Tabu Search.

from the QEP is selected at random. The move consists of changing its join method. The two most popular join methods, nested-loop and sort- merge, are considered. So, the move consists of changing the join method of the join operator from nested-loop to sort-merge, or vice versa. In order to carry out jo in method argument exchange, one join operator from the QEP with the nested-loop method is selected at random. The move consists of exchanging arguments of the join method. So, the outer relation becomes the inner relation, or vice versa. The change of jo in operator o rder is performed as follows. Select at random one join operator, say J~, from the QEP, but not the last one. Then, find another join, say J j , in QEP, such that the result of J~ is an argument of J j . The move consists of changing the order of join operators in such a way that result of J j becomes an argument for J~--it may be either the outer or the inner argument.

The different probabilities are associated with move types. Since the join order exchange enables us to investigate the state space faster, the probability 0.6 is assigned to this move. The change of the join method as well as the change of join arguments has probability 0.2.

5.2. COST FUNCTION

The cost function, corresponding to an execution time of a query, gives a cost for every state, taking into account parameters of relations, join methods, the size of available memory, existence of indexes, etc. There are a few cost formulas which are used, depending on the kind of join method

82 M. MATYS IAK

TABLE 1 Tabu Search Parameters

Parameters Value

initial state next state

local minimum stopping condition length of tabu list

first local minimum found the best neighbor, not on tabu list r-local minimum (20 neighbors) time limit 2O

TABLE 2 Iterative Improvement Parameters

Parameter Value

initial state next state new starting state local minimum stopping condition

random state random neighbor random state r-local minimum (20 neighbors) both time and local optimum

and the existence of indexes. These formulas are based on the following assumptions: (1) there is no pipeline processing, i.e., all base relat ions are being fetched from a disk as well as all intermediate join results are mater ia l ized; (2) sizes of indexes are much smaller than sizes of associated relations; (3) minimal buffering for operat ions; (4) size of a disk page is 4 KB; and (5) t ransmiss ion cost f rom/to a disk is ten t imes greater than CPU processing cost.

5.3. COMBINATORIAL ALGORITHM-SPECIF IC PARAMETERS

There are several implementat ion-speci f ic features and parameters of the algorithms. These parameters influence the performance of the algorithms, and can be tuned to increase the qual i ty of the outcomes. A summary of these features and algor i thm parameters is presented below in Tables 1-3.

6. PARAMETERS OF EXPERIMENTS

In this exper iment, various parameters of relat ions and queries were generated to find the performance of a lgor i thms in different c ircumstances. The query size ranges from 10 to 100 joins. Joining relat ions and their jo in ing at t r ibutes are selected at random. The query type is specified

EFFIC IENT OPTIMIZATION OF JOIN QUERIES 83

TABLE 3 Simulated Annealing Parameters

Parameter Value

Initial temperature max_cost(R)-min_cost(R), where R is a set of 20 random states exponential according to a time limit random state random neighbor 1/10 of a time limit always at the end of time limit

temperature reduction initial state next state inner-loop criterion systems has frozen

with parameter called ft. It takes on values from 1 up to 10. This range corresponds to various query graphs between linear trees, through bushy up to starlike trees.

The database profile used in experiments is characterized by the follow- ing parameters. Cardinalities of relations were selected between 100-1000 with probability 0.2, 1000-10000 with probability 0.6, and 10 000-50 000 with probability 0.2. The ballast, or width of relations without joining at- tributes, was randomly chosen between 50 and 500. All joining attributes have the same width, and each has randomly chosen domain cardinality. Indexes for relations were chosen at random with probability 0.3. In addi- tion, the following assumption was made. If there is an index for a relation, then it means that all joining attributes in this relation have indexes. There is a uniform distribution of attribute values and independence of values in the join attributes [15].

7. RESULTS OF EXPERIMENTS

The algorithms were implemented in C and tested on a Sun-4 ELC workstation. For defined parameters of queries (N and tt) and established optimization time limit (time), each algorithm was run 50 times. In order to compare the relative performance of algorithms, the cost of obtained solutions was scaled. The scaled cost represents the solution cost over the minimum solution cost found for the query. There are several charts presented below (Figures 2-4). Depending on what the x-axis represents~ the chart is called C(N), C(time), or C(tt).

Figure 2 shows the C(N) charts. For queries requiring more than 30 joins, TS always returns the best results. However, for smaller queries, ranking of the algorithms strongly depends on parameters tt and time. Figure 3 shows the C(time) charts. Between 5 and 15 seconds (N = 100),

84 M. MATYS IAK

II o

II

o

II

E

II

r..)

II

U

~9

\o 8 o -

s

\\\\

~D II

II

(.9

~D II

II

\ \

~o

~9

..=

8

at

Z~

EFFICIENT OPTIMIZATION OF JOIN QUERIES 85

II

v

_o /

II

II Z

o fj,

l J '~ II

" i II z

,/ /

. z

86 M. MATYSIAK

! II

- / H Z

- i /

/ /

0 II o

Q

II Z

O iI 0.)

T Q

E~ /

y .

J I_ 8_ z

I f~-

~2

6

6

EFF IC IENT OPT IMIZAT ION OF JOIN QUERIES 87

TS signif icantly improves its outcomes. It is interesting that for N = 10, TS works very poor ly irrespective of opt imizat ion time. F igure 4 shows the C(t t ) charts. The query type great ly affects the performance of algorithms. For N = 100, TS obtains the best results, but for N = 10, the posit ion of TS is much worse for starl ike queries than for linear. The explanat ion of such TS behavior for N = 10 may be that for a relat ively small number of' joints, tabu l ist holds most of the neighborhood. So, the penetrat ing abi l i ty in such a case is great ly l imited according to other algorithms.

8. CONCLUSIONS

In the study, the relat ively new combinator ia l opt imizat ion technique, called Tabu Search, was adopted to the large join query opt imizat ion prob- lem. For different queries (types and sizes) and various opt imizat ion t imes, it ahnost always obta ins better results than other a lgor i thms such as I tera- t ive Improvement and Simulated Anneal ing. However, for relat ively small queries, it seems to work rather worse than others. Future work will deal with apply ing the TS to mult iquery opt imizat ion and to paral lel scheduling of N -way join queries.

REFERENCES

1. W. W. Chu and P. Hurley, Optimal query processing for distributed database systems, IEEE Transactions on Computers C-31(9):835 850 (1982).

2. D.H. Fishman, D. Beech, H. P. Care, E. C. Chow, T. Connors, J. W. Dawis, N. Derrett, C. G. Hoch, W. Kent, P. Lyngbaek, B. Mahbod, M. A. Neimat, T. A. Ryan, and M. C. Shan, IRIS: An object-oriented DBMS, ACM Transactions on Orifice Information Systems 5(1):48-69 (1987).

3. F. Glover, Tabu Search, CAAI Report 88-3, University of Colorado, Boulder, 1988. 4. F. Glover, E. Taillard, and D. de Werra, A user's guide to Tabu Search, Annals of

Operations Research 41(1 4) (1993). 5. T. Ibaraki and T. Kameda, Optimal nesting for computing N-relational joins, ACM

Transactions on Database Systems 9(3):482-502 (1984). 6. Y. E. Ioannidis and Y. Kang, Randomized algorithms for optimizing large join

queries, in Proe. of ACM-S IGMOD Conference on Management of Data, 1990, pp. 312-321.

7. Y. E. Ioannidis and Y. Kang, Left-deep vs. bushy trees: An analysis of strategy spaces and its implications for query optimization, in Proc. of ACM-S IGMOD Conference on Management of Data, 1991, pp. 168-177.

8. Y.E. Ioannidis and E. Wong, Query optimization by simulated annealing, in Proc. of ACM-S IGMOD Conference on Management of Data, 1987, pp. 9-22.

9. M. Jarke and J. Koch, Query optimization in database systems, ACM Computing Surveys 16(2):111-152 (1984).

10. R. Krishnamurthy, H. Boral, and C. Zaniola, Optimization of nonrecursive queries, in Proc. of the 12th VLDB Conference, Kyoto, Japan, 1986, pp. 128-137.

88

11.

M. MATYSIAK

R. S. G. Lanzelotte and P. Valduriez, Extending the search strategy in a query optimizer, in Proc. of the 17th VLDB Conference, Barcelona, Spain, 1991, pp. 363-373.

12. G.M. Lohman, C. Mohan, L. M. Haas, B. G. Lindsay, P. G. Salinger, P. F. Wilms, and D. Daniels, Query processing in R*, in Query Processing in Database Systems (Kim, Batory, and Riener, eds., Springer-Verlag, 1985, pp. 31-47.

13. K. Ono and G. M. Lohman, Measuring the complexity of join enumeration in query optimization, in Proe. of the 16th VLDB Conference, Brisbane, Australia, 1990, pp. 314-325.

14. C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, 1982.

15. P .G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, Access path selection in a relational database management system, in Proe. of ACM-SIGMOD, 1979, pp. 23-34.

16. A. Swami, Optimization of large join queries: Combining heuristics and combina- torial techniques in Proc. of A CM-SIGMOD Conference on Management of Data, 1989, pp. 367-376.

17. A. Swami and A. Gupta, Optimization of large join queries, in Proc. of ACM-S IGMOD Conference on Management of Data, 1988, pp. 8-17.

18. A. Swami and B. R. Iyer, A polynomial time algorithm for optimizing join queries, in Proc. of the 9th IEEE Conference on Data Engineering, Vienna, Austria, 1993, pp. 345-354.

Received 1 March 199~

Recommended