efficient optimization of large join queries using tabu search

12
NORTH-HOLLAND Efficient Optimization of Large Join Queries Using Tabu Search MACIEJ MATYSIAK Institute of Computing Science, Poznad University of Technology, ul.Piotrowo 3a, 60-965 Poznad, Poland ABSTRACT Finding an optimal query execution plan for join queries is a hard combina- torial optimization problem. In order to cope with complex large join queries, combinatorial optimization algorithms, such as Simulating Annealing and It- erative Improvement, were proposed as alternatives to traditional enumerative algorithms. In this paper, the relatively new combinatorial optimization tech- nique called Tabu Search is applied. Considering various query sizes, query graph shapes, and optimization times, it is shown that Tabu Search almost al- ways obtains better query execution plans than other combinatorial optimization techniques. 1. INTRODUCTION A query optimizer in a relational database management system trans- lates nonprocedural query into a procedural plan for execution by generat- ing many alternative query execution plans (QEP), estimating the execu- tion cost of each, and choosing the plan having the lowest estimated cost. Computational complexity of the optimization process is determined by the number of alternative QEPs that must be evaluated by the optimizer. In general, the number of alternative plans grows exponentially with the number of relations involved in the query. Traditional query optimizers deal with queries involving only a small number of relations. Therefore, they can use enumerative optimization strategies which consider most of the alternative QEPs [12, 13, 15]. The complexity of the query optimization problem increases when we consider nontraditional applications, such as decision support systems, ex- pert systems, knowledge base systems, object-oriented database systems, and applications from logic programming. These applications tend to pose much more complex queries referring to more relations and requiring pro- INFORMATION SCIENCES 83~ 77-88 (1995) (~) Elsevier Science Inc., 1995 655 Avenue of the Americas, New York, NY 10010 0020-0255/95/$9.50 SSDI 0020-0255(94)00094-R

Upload: maciej-matysiak

Post on 15-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

NORTH- HOLLAND

Efficient Opt imiza t ion of Large Join Queries Us ing Tabu Search

MACIEJ MATYSIAK Institute of Computing Science, Poznad University of Technology, ul.Piotrowo 3a, 60-965 Poznad, Poland

ABSTRACT

Finding an optimal query execution plan for join queries is a hard combina- torial optimization problem. In order to cope with complex large join queries, combinatorial optimization algorithms, such as Simulating Annealing and It- erative Improvement, were proposed as alternatives to traditional enumerative algorithms. In this paper, the relatively new combinatorial optimization tech- nique called Tabu Search is applied. Considering various query sizes, query graph shapes, and optimization times, it is shown that Tabu Search almost al- ways obtains better query execution plans than other combinatorial optimization techniques.

1. I NTRODUCTION

A query optimizer in a relational database management system trans- lates nonprocedural query into a procedural plan for execution by generat- ing many alternative query execution plans (QEP), estimating the execu- tion cost of each, and choosing the plan having the lowest estimated cost. Computational complexity of the optimization process is determined by the number of alternative QEPs that must be evaluated by the optimizer. In general, the number of alternative plans grows exponentially with the number of relations involved in the query. Traditional query optimizers deal with queries involving only a small number of relations. Therefore, they can use enumerative optimization strategies which consider most of the alternative QEPs [12, 13, 15].

The complexity of the query optimization problem increases when we consider nontraditional applications, such as decision support systems, ex- pert systems, knowledge base systems, object-oriented database systems, and applications from logic programming. These applications tend to pose much more complex queries referring to more relations and requiring pro-

INFORMATION SCIENCES 83~ 77-88 (1995) (~) Elsevier Science Inc., 1995 655 Avenue of the Americas, New York, NY 10010

0020-0255/95/$9.50 SSDI 0020-0255(94)00094-R

78 M. MATYSIAK

cessing more joins than traditional applications [2, 10, 13]. For these appli- cations, the enumerative optimization strategies are inadequate, even with the above-mentioned heuristics, because they face a combinatorial explo- sion of alternative QEPs to generate and evaluate.

Combinatorial strategies for query optimization were first described in in [8]. In [16, 17], Swami and Gupta investigated the problem of using Simulated Annealing (SA) and Iterative Improvement (II) for optimizing nonrecursive large join queries. Later, Ioannidis and Kang [6, 7] proposed a new hybrid algorithm (called 2PO) that combines II and SA. Recently, Swami and Iyer [18] proposed a new polynomial time algorithm that com- bines combinatorial searching with the enumerative neighborhood search algorithmn proposed earlier by Ibaraki and Kameda [5].

In the paper, the relatively new combinatorial optimization technique called Tabu Search (TS) [3, 4] is proposed to apply to optimization of large join queries. The major contribution of the paper is the comparison of the performance of Tabu Search with the performance of Simulating An- nealing and Iterative Improvement. The performance of these techniques is analyzed with respect to different query types (linear, star, and bushy), dif- ferent query execution plan types (outer linear join processing trees, bushy join processing trees), and different sizes of queries. It is shown that TS almost always produces a better QEP.

The paper is organized as follows. In Section 2, the problem of large join query optimization is formulated in terms of the combinatorial optimiza- tion problem. Section 3 describes the Iterative Improvement and Simulated Annealing algorithms. The Tabu Search is discussed in Section 4. The ap- plication of combinatorial techniques to query optimization is presented in Section 5. Section 6 and 7 show the parameters and results of experiments.

2. PROBLEM FORMULATION

The execution cost of a relational query depends on the order in which relational operators involved in the query are performed. This order rep- resents a query execution plan (QEP). For a given query, there might be many alternative QEPs. The query optimization problem is to find QEP having a cost as low as possible.

It has been shown [1, 9, 15] that using the heuristics of performing selections and projections as early as possible and excluding unnecessary Cartesian products, we can eliminate from consideration certain subop- timal QEPs. Thus, considering the join operation as a 2-way join, the optimizer must select the best sequence of 2-way joins to achieve the N- way join of relations requested by the query. Each join in this sequence may have its own performing method (nested-loop, sort-merge, hash-based,

E F F I C I E N T O P T I M I Z A T I O N OF JOIN QUERIES 79

etc.) Most join methods (such as, e.g., nested-loop) distinguish between the two operands, one being the outer relation and the other being the inner relation.

Thus, the query optimization problem is reduced to finding the order in which relations should be joined, together with the best join method for each 2-way join and the best argument assignment of each join. In the paper, the large jo in query is a query for which the number of required join operations is equal to or greater than 10.

3. C O M B I N A T O R I A L A L G O R I T H M S

3.1. I N T R O D U C T I O N AND T E R M I N O L O G Y

Each solution to a combinatorial optimization problem is considered as a state in a state space. Each state S has a cost associated with it, cost(S), as given by some cost function. The aim of the optimization algorithm is minimizing the cost(S) by performing random walks in the s tate space. The walk consists of a sequence of moves. A move is a t ransformation applied to a s tate to get another state. The states tha t can be reached in one move from a s tate S are called neighbors of S. A move is called downward (upward) if the cost of the source state is lower (higher) than the cost of the destination state. A state is called local m i n i m u m if its cost is equal to or lower than tha t of all its neighbors. A state is called global m i n i m u m if it has the lowest cost among all states in the state space. The opt imal solution is a global minimum.

3.2. I T E R A T I V E I M P R O V E M E N T (II) AND SIMULATED A N N E A L I N G (SA)

The II is a very simple metaheuristic procedure, characterized also as a valley descending technique. I t does as follows. The start ing s tate is selected randomly. Then, II walks downward, choosing any of the neigh- boring states tha t decrease the cost until it reaches the local minimum. II repeats the local optimization procedure until a stopping condition is sat- isfied, and then the local minimum with the lowest cost found is returned. Further information about II can be found in [6, 16].

In contrast to II, Simulating Annealing investigates the s tate space by performing both downward and upward moves. It always accepts downward moves, but upward moves are being accepted with a probabili ty which depends on the increase in cost between a neighbor and a current state, and a parameter called the tempera ture (T). The higher the t empera ture or the smaller the cost difference, the more likely tha t an upward move will be accepted. The single optimization consists of several walks through

80 M. MATYSIAK

the state space. The walks end when some inner-loop criterion is satisfied. Then, the temperature is reduced according to some function and another walk begins. The algorithm stops when the system is considered to be frozen, i.e., when the temperature is equal to zero. The formal description of this algorithm can be found in [6, 8, 16].

4. TABU SEARCH

The main idea of TS consists of an exploration of a state space re- membering a path of recently visited states. This path (called tabu list) constitutes a set of restrictions which are used to prevent the reversal, or sometimes repetition. It also induces the search to follow a new trajectory if cycling in a narrower sense occurs.

TS does as follows. The starting state is a randomly generated state. To make a move, the iterative procedure generates a sample V* of states from among the set of neighbors N ( S ) of the current state S. A best state S* in V* is determined and a move from S to S* is made, irrespective of whether it is the downward or upward move. A number of recently visited states is kept on tabu list T. It forbids moves which should bring the algorithm back to a previously visited state. Therefore, whenever at a state S a set V* of states in N ( S ) has to be generated, we check that the candidate for membership of V* is not in T. The iterative procedure stops if some stopping condition is satisfied. It may be, for instance, lack of improvement of the best solution during a number of kmax iterations. The general Tabu Search algorithm is presented in Figure 1.

In the study, a modified version of TS is used. Instead of randomly generating an initial state, the descend procedure is used to find a local minimum. Instead of storing in T a set of visited states, a set of moves is kept. I t reduces the space and time necessary to check the tabu restrictions. In order to compare TS with other algorithms, a time limit instead of the number of iterations is used to stop the optimization process.

5. APPLICATION OF TABU SEARCH

5.1. MOVE SET

The QEP is a single state in a state space and is represented by a string of joins, where each join has its own performing method and two distinguished arguments. A move is a single modification applied to a state to get another state. In order to ensure the possibility of reaching every state in a state space, three types of moves are defined: join method exchange, join method argument exchange, and change of join operator order.

The jo in m e t h o d e x c h a n g e is performed as follows. One join operator

E F F I C I E N T OPTIMIZATION OF JOIN QUERIES 81

procedure TS() { /* Get an initial solution */ S = ini t ial ize(); rains = .5"; /* set a tabu list */ T = 0 ; repeat {

generate V* __ N(S) - 7'; choose best ,5'* E V*; S = S * ; /* update tabu list T */ T = (T - (oldest))U(S); if cost(S) < cost(rainS) then rains = S; )

until (stopping condition); return (rainS);

Fig. 1. Tabu Search.

from the QEP is selected at random. The move consists of changing its join method. The two most popular join methods, nested- loop and sort- merge , are considered. So, the move consists of changing the join method of the join operator from nested-loop to sort-merge, or vice versa. In order to carry out j o in m e t h o d a r g u m e n t exchange , one join operator from the QEP with the nested-loop method is selected at random. The move consists of exchanging arguments of the join method. So, the outer relation becomes the inner relation, or vice versa. The c h a n g e of j o in o p e r a t o r o r d e r is performed as follows. Select at random one join operator, say J~, from the QEP, but not the last one. Then, find another join, say J j , in QEP, such that the result of J~ is an argument of J j . The move consists of changing the order of join operators in such a way that result of J j becomes an argument for J~--it may be either the outer or the inner argument.

The different probabilities are associated with move types. Since the join order exchange enables us to investigate the state space faster, the probability 0.6 is assigned to this move. The change of the join method as well as the change of join arguments has probability 0.2.

5.2. C O S T F U N C T I O N

The cost function, corresponding to an execution time of a query, gives a cost for every state, taking into account parameters of relations, join methods, the size of available memory, existence of indexes, etc. There are a few cost formulas which are used, depending on the kind of join method

82 M. M A T Y S I A K

TABLE 1 Tabu Search Parameters

Parameters Value

initial state next state

local minimum stopping condition length of tabu list

first local minimum found the best neighbor, not on tabu list r-local minimum (20 neighbors) time limit 2O

TABLE 2 Iterative Improvement Parameters

Parameter Value

initial state next state new starting state local minimum stopping condition

random state random neighbor random state r-local minimum (20 neighbors) both time and local optimum

and the ex is tence of indexes. These formulas are based on the following a s sumpt ions : (1) the re is no pipel ine processing, i.e., all base re la t ions are be ing fe tched from a disk as well as all i n t e r m e d i a t e jo in resul t s are ma te r i a l i zed ; (2) sizes of indexes are much smal ler t h a n sizes of a s soc ia t ed re la t ions ; (3) m in ima l buffering for opera t ions ; (4) size of a disk page is 4 KB; and (5) t r ansmis s ion cost f r o m / t o a disk is t en t imes g rea t e r t h a n C P U process ing cost.

5.3. C O M B I N A T O R I A L A L G O R I T H M - S P E C I F I C P A R A M E T E R S

T h e r e are severa l implementa t ion - spec i f i c fea tures and p a r a m e t e r s of t he a lgor i thms . These p a r a m e t e r s influence the pe r fo rmance of t he a lgor i thms , and can be t u n e d to increase t he qua l i ty of the outcomes . A s u m m a r y of these fea tures and a lgo r i t hm p a r a m e t e r s is p resen ted below in Tables 1-3.

6. P A R A M E T E R S O F E X P E R I M E N T S

In th is expe r imen t , var ious p a r a m e t e r s of re la t ions and quer ies were gene ra t ed to find the pe r fo rmance of a lgor i thms in different c i rcumstances . T h e query size ranges from 10 to 100 joins. Jo in ing re la t ions and the i r j o in ing a t t r i b u t e s a re se lec ted a t r andom. T h e query t y p e is specif ied

E F F I C I E N T OPTIMIZATION OF JOIN QUERIES 83

TABLE 3 Simulated Annealing Parameters

Parameter Value

Initial temperature max_cost(R)-min_cost(R), where R is a set of 20 random states exponential according to a time limit random state random neighbor 1/10 of a time limit always at the end of time limit

temperature reduction initial state next state inner-loop criterion systems has frozen

with parameter called ft. It takes on values from 1 up to 10. This range corresponds to various query graphs between linear trees, through bushy up to starlike trees.

The database profile used in experiments is characterized by the follow- ing parameters. Cardinalities of relations were selected between 100-1000 with probability 0.2, 1000-10000 with probability 0.6, and 10 000-50 000 with probability 0.2. The ballast, or width of relations without joining at- tributes, was randomly chosen between 50 and 500. All joining attributes have the same width, and each has randomly chosen domain cardinality. Indexes for relations were chosen at random with probability 0.3. In addi- tion, the following assumption was made. If there is an index for a relation, then it means that all joining attributes in this relation have indexes. There is a uniform distribution of attribute values and independence of values in the join attributes [15].

7. RESULTS OF EXPERIMENTS

The algorithms were implemented in C and tested on a Sun-4 ELC workstation. For defined parameters of queries (N and tt) and established optimization time limit (time), each algorithm was run 50 times. In order to compare the relative performance of algorithms, the cost of obtained solutions was scaled. The scaled cost represents the solution cost over the minimum solution cost found for the query. There are several charts presented below (Figures 2-4). Depending on what the x-axis represents~ the chart is called C(N) , C(time), or C(tt).

Figure 2 shows the C(N) charts. For queries requiring more than 30 joins, TS always returns the best results. However, for smaller queries, ranking of the algorithms strongly depends on parameters tt and time. Figure 3 shows the C(time) charts. Between 5 and 15 seconds (N = 100),

84 M. M A T Y S I A K

II o

II

o

II

E

II

r..)

II

U

~ 9

\o § 8 o -

s

\\\\

~D II

II

(.9

~D II

II

\ \

~o

~9

..=

8

at

Z~

EFFICIENT OPTIMIZATION OF JOIN QUERIES 85

II

v

_o /

II

II Z

o fj,

l J '~ II

" i II z

,/ /

. z

86 M. MATYSIAK

° ! II

- / H Z

- i/

/ /

0

II o

Q

II Z

O

iI 0.)

T Q

E~

/

y .

J I_ § 8_ z

I f~-

~2

6

6

E F F I C I E N T O P T I M I Z A T I O N O F J O I N Q U E R I E S 87

TS s igni f ican t ly improves its ou tcomes . I t is in te res t ing t h a t for N = 10, TS works ve ry p o o r l y i r respec t ive of o p t i m i z a t i o n t ime. F igure 4 shows the C ( t t ) char ts . T h e query t y p e g rea t ly affects the pe r fo rmance of a lgor i thms. For N = 100, TS ob ta in s t he bes t resul ts , bu t for N = 10, the pos i t ion of TS is much worse for s ta r l ike queries t h a n for l inear. The e x p l a n a t i o n of such TS behav io r for N = 10 m a y be t h a t for a re la t ive ly smal l number of' jo in ts , t a b u l i s t holds most of t he ne ighborhood . So, the p e n e t r a t i n g ab i l i ty in such a case is g r ea t ly l imi ted according to o ther a lgor i thms.

8. C O N C L U S I O N S

In the s tudy, the re la t ive ly new combina to r i a l o p t i m i z a t i o n technique, cal led T a b u Search, was a d o p t e d to the large jo in query o p t i m i z a t i o n prob- lem. For different queries ( types and sizes) and var ious o p t i m i z a t i o n t imes , i t ahnos t a lways ob t a in s be t t e r resul ts t h a n o ther a lgor i thms such as I t e ra - t ive I m p r o v e m e n t and S imula t ed Anneal ing . However, for re la t ive ly smal l queries, i t seems to work r a the r worse t h a n others . F u t u r e work will deal w i th app ly ing the TS to mu l t i que ry o p t i m i z a t i o n and to para l le l schedul ing of N - w a y jo in queries.

R E F E R E N C E S

1. W. W. Chu and P. Hurley, Optimal query processing for distributed database systems, IEEE Transactions on Computers C-31(9):835 850 (1982).

2. D.H. Fishman, D. Beech, H. P. Care, E. C. Chow, T. Connors, J. W. Dawis, N. Derrett, C. G. Hoch, W. Kent, P. Lyngbaek, B. Mahbod, M. A. Neimat, T. A. Ryan, and M. C. Shan, IRIS: An object-oriented DBMS, A C M Transactions on Orifice Information Systems 5(1):48-69 (1987).

3. F. Glover, Tabu Search, CAAI Report 88-3, University of Colorado, Boulder, 1988. 4. F. Glover, E. Taillard, and D. de Werra, A user's guide to Tabu Search, Annals of

Operations Research 41(1 4) (1993). 5. T. Ibaraki and T. Kameda, Optimal nesting for computing N-relational joins, A C M

Transactions on Database Systems 9(3):482-502 (1984). 6. Y. E. Ioannidis and Y. Kang, Randomized algorithms for optimizing large join

queries, in Proe. of A C M - S I G M O D Conference on Management of Data, 1990, pp. 312-321.

7. Y. E. Ioannidis and Y. Kang, Left-deep vs. bushy trees: An analysis of strategy spaces and its implications for query optimization, in Proc. of A C M - S I G M O D Conference on Management of Data, 1991, pp. 168-177.

8. Y.E. Ioannidis and E. Wong, Query optimization by simulated annealing, in Proc. of A C M - S I G M O D Conference on Management of Data, 1987, pp. 9-22.

9. M. Jarke and J. Koch, Query optimization in database systems, A C M Computing Surveys 16(2):111-152 (1984).

10. R. Krishnamurthy, H. Boral, and C. Zaniola, Optimization of nonrecursive queries, in Proc. of the 12th VLDB Conference, Kyoto, Japan, 1986, pp. 128-137.

88

11.

M. MATYSIAK

R. S. G. Lanzelotte and P. Valduriez, Extending the search strategy in a query optimizer, in Proc. of the 17th VLDB Conference, Barcelona, Spain, 1991, pp. 363-373.

12. G .M. Lohman, C. Mohan, L. M. Haas, B. G. Lindsay, P. G. Salinger, P. F. Wilms, and D. Daniels, Query processing in R*, in Query Processing in Database Systems (Kim, Batory, and Riener, eds., Springer-Verlag, 1985, pp. 31-47.

13. K. Ono and G. M. Lohman, Measuring the complexity of join enumeration in query optimization, in Proe. of the 16th VLDB Conference, Brisbane, Australia, 1990, pp. 314-325.

14. C. H. Papadimitr iou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, 1982.

15. P . G . Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, Access path selection in a relational database management system, in Proe. of ACM-SIGMOD , 1979, pp. 23-34.

16. A. Swami, Optimization of large join queries: Combining heuristics and combina- torial techniques in Proc. of A C M - S I G M O D Conference on Management of Data, 1989, pp. 367-376.

17. A. Swami and A. Gupta, Optimization of large join queries, in Proc. of A C M - S I G M O D Conference on Management of Data, 1988, pp. 8-17.

18. A. Swami and B. R. Iyer, A polynomial t ime algorithm for optimizing join queries, in Proc. of the 9th IEEE Conference on Data Engineering, Vienna, Austria, 1993, pp. 345-354.

Received 1 March 199~