Efficient optimization of large join queries using Tabu Search

Download Efficient optimization of large join queries using Tabu Search

Post on 15-Jul-2016

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li><p>NORTH- HOLLAND </p><p>Efficient Opt imizat ion of Large Join Queries Using Tabu Search </p><p>MACIEJ MATYSIAK Institute of Computing Science, Poznad University of Technology, ul.Piotrowo 3a, 60-965 Poznad, Poland </p><p>ABSTRACT </p><p>Finding an optimal query execution plan for join queries is a hard combina- torial optimization problem. In order to cope with complex large join queries, combinatorial optimization algorithms, such as Simulating Annealing and It- erative Improvement, were proposed as alternatives to traditional enumerative algorithms. In this paper, the relatively new combinatorial optimization tech- nique called Tabu Search is applied. Considering various query sizes, query graph shapes, and optimization times, it is shown that Tabu Search almost al- ways obtains better query execution plans than other combinatorial optimization techniques. </p><p>1. INTRODUCTION </p><p>A query optimizer in a relational database management system trans- lates nonprocedural query into a procedural plan for execution by generat- ing many alternative query execution plans (QEP), estimating the execu- tion cost of each, and choosing the plan having the lowest estimated cost. Computational complexity of the optimization process is determined by the number of alternative QEPs that must be evaluated by the optimizer. In general, the number of alternative plans grows exponentially with the number of relations involved in the query. Traditional query optimizers deal with queries involving only a small number of relations. Therefore, they can use enumerative optimization strategies which consider most of the alternative QEPs [12, 13, 15]. </p><p>The complexity of the query optimization problem increases when we consider nontraditional applications, such as decision support systems, ex- pert systems, knowledge base systems, object-oriented database systems, and applications from logic programming. These applications tend to pose much more complex queries referring to more relations and requiring pro- </p><p>INFORMATION SCIENCES 83~ 77-88 (1995) (~) Elsevier Science Inc., 1995 655 Avenue of the Americas, New York, NY 10010 </p><p>0020-0255/95/$9.50 SSDI 0020-0255(94)00094-R </p></li><li><p>78 M. MATYSIAK </p><p>cessing more joins than traditional applications [2, 10, 13]. For these appli- cations, the enumerative optimization strategies are inadequate, even with the above-mentioned heuristics, because they face a combinatorial explo- sion of alternative QEPs to generate and evaluate. </p><p>Combinatorial strategies for query optimization were first described in in [8]. In [16, 17], Swami and Gupta investigated the problem of using Simulated Annealing (SA) and Iterative Improvement (II) for optimizing nonrecursive large join queries. Later, Ioannidis and Kang [6, 7] proposed a new hybrid algorithm (called 2PO) that combines II and SA. Recently, Swami and Iyer [18] proposed a new polynomial time algorithm that com- bines combinatorial searching with the enumerative neighborhood search algorithmn proposed earlier by Ibaraki and Kameda [5]. </p><p>In the paper, the relatively new combinatorial optimization technique called Tabu Search (TS) [3, 4] is proposed to apply to optimization of large join queries. The major contribution of the paper is the comparison of the performance of Tabu Search with the performance of Simulating An- nealing and Iterative Improvement. The performance of these techniques is analyzed with respect to different query types (linear, star, and bushy), dif- ferent query execution plan types (outer linear join processing trees, bushy join processing trees), and different sizes of queries. It is shown that TS almost always produces a better QEP. </p><p>The paper is organized as follows. In Section 2, the problem of large join query optimization is formulated in terms of the combinatorial optimiza- tion problem. Section 3 describes the Iterative Improvement and Simulated Annealing algorithms. The Tabu Search is discussed in Section 4. The ap- plication of combinatorial techniques to query optimization is presented in Section 5. Section 6 and 7 show the parameters and results of experiments. </p><p>2. PROBLEM FORMULATION </p><p>The execution cost of a relational query depends on the order in which relational operators involved in the query are performed. This order rep- resents a query execution plan (QEP). For a given query, there might be many alternative QEPs. The query optimization problem is to find QEP having a cost as low as possible. </p><p>It has been shown [1, 9, 15] that using the heuristics of performing selections and projections as early as possible and excluding unnecessary Cartesian products, we can eliminate from consideration certain subop- timal QEPs. Thus, considering the join operation as a 2-way join, the optimizer must select the best sequence of 2-way joins to achieve the N- way join of relations requested by the query. Each join in this sequence may have its own performing method (nested-loop, sort-merge, hash-based, </p></li><li><p>EFF IC IENT OPTIMIZATION OF JOIN QUERIES 79 </p><p>etc.) Most join methods (such as, e.g., nested-loop) distinguish between the two operands, one being the outer relation and the other being the inner relation. </p><p>Thus, the query optimization problem is reduced to finding the order in which relations should be joined, together with the best join method for each 2-way join and the best argument assignment of each join. In the paper, the large join query is a query for which the number of required join operations is equal to or greater than 10. </p><p>3. COMBINATORIAL ALGORITHMS </p><p>3.1. INTRODUCTION AND TERMINOLOGY </p><p>Each solution to a combinatorial optimization problem is considered as a state in a state space. Each state S has a cost associated with it, cost(S), as given by some cost function. The aim of the optimization algorithm is minimizing the cost(S) by performing random walks in the state space. The walk consists of a sequence of moves. A move is a transformation applied to a state to get another state. The states that can be reached in one move from a state S are called neighbors of S. A move is called downward (upward) if the cost of the source state is lower (higher) than the cost of the destination state. A state is called local min imum if its cost is equal to or lower than that of all its neighbors. A state is called global min imum if it has the lowest cost among all states in the state space. The optimal solution is a global minimum. </p><p>3.2. ITERAT IVE IMPROVEMENT (II) AND SIMULATED ANNEALING (SA) </p><p>The II is a very simple metaheuristic procedure, characterized also as a valley descending technique. It does as follows. The starting state is selected randomly. Then, II walks downward, choosing any of the neigh- boring states that decrease the cost until it reaches the local minimum. II repeats the local optimization procedure until a stopping condition is sat- isfied, and then the local minimum with the lowest cost found is returned. Further information about II can be found in [6, 16]. </p><p>In contrast to II, Simulating Annealing investigates the state space by performing both downward and upward moves. It always accepts downward moves, but upward moves are being accepted with a probability which depends on the increase in cost between a neighbor and a current state, and a parameter called the temperature (T). The higher the temperature or the smaller the cost difference, the more likely that an upward move will be accepted. The single optimization consists of several walks through </p></li><li><p>80 M. MATYSIAK </p><p>the state space. The walks end when some inner-loop criterion is satisfied. Then, the temperature is reduced according to some function and another walk begins. The algorithm stops when the system is considered to be frozen, i.e., when the temperature is equal to zero. The formal description of this algorithm can be found in [6, 8, 16]. </p><p>4. TABU SEARCH </p><p>The main idea of TS consists of an exploration of a state space re- membering a path of recently visited states. This path (called tabu list) constitutes a set of restrictions which are used to prevent the reversal, or sometimes repetition. It also induces the search to follow a new trajectory if cycling in a narrower sense occurs. </p><p>TS does as follows. The starting state is a randomly generated state. To make a move, the iterative procedure generates a sample V* of states from among the set of neighbors N(S) of the current state S. A best state S* in V* is determined and a move from S to S* is made, irrespective of whether it is the downward or upward move. A number of recently visited states is kept on tabu list T. It forbids moves which should bring the algorithm back to a previously visited state. Therefore, whenever at a state S a set V* of states in N(S) has to be generated, we check that the candidate for membership of V* is not in T. The iterative procedure stops if some stopping condition is satisfied. It may be, for instance, lack of improvement of the best solution during a number of kmax iterations. The general Tabu Search algorithm is presented in Figure 1. </p><p>In the study, a modified version of TS is used. Instead of randomly generating an initial state, the descend procedure is used to find a local minimum. Instead of storing in T a set of visited states, a set of moves is kept. It reduces the space and time necessary to check the tabu restrictions. In order to compare TS with other algorithms, a time limit instead of the number of iterations is used to stop the optimization process. </p><p>5. APPLICATION OF TABU SEARCH </p><p>5.1. MOVE SET </p><p>The QEP is a single state in a state space and is represented by a string of joins, where each join has its own performing method and two distinguished arguments. A move is a single modification applied to a state to get another state. In order to ensure the possibility of reaching every state in a state space, three types of moves are defined: join method exchange, join method argument exchange, and change of join operator order. </p><p>The jo in method exchange is performed as follows. One join operator </p></li><li><p>EFF IC IENT OPTIMIZATION OF JOIN QUERIES 81 </p><p>procedure TS() { /* Get an initial solution */ S = initialize(); rains = .5"; /* set a tabu list */ T=0; repeat { </p><p>generate V* __ N(S) - 7'; choose best ,5'* E V*; S=S*; /* update tabu list T */ T = (T - (oldest))U(S); if cost(S) &lt; cost(rainS) then rains = S; ) </p><p>until (stopping condition); return (rainS); </p><p>Fig. 1. Tabu Search. </p><p>from the QEP is selected at random. The move consists of changing its join method. The two most popular join methods, nested-loop and sort- merge, are considered. So, the move consists of changing the join method of the join operator from nested-loop to sort-merge, or vice versa. In order to carry out jo in method argument exchange, one join operator from the QEP with the nested-loop method is selected at random. The move consists of exchanging arguments of the join method. So, the outer relation becomes the inner relation, or vice versa. The change of jo in operator o rder is performed as follows. Select at random one join operator, say J~, from the QEP, but not the last one. Then, find another join, say J j , in QEP, such that the result of J~ is an argument of J j . The move consists of changing the order of join operators in such a way that result of J j becomes an argument for J~--it may be either the outer or the inner argument. </p><p>The different probabilities are associated with move types. Since the join order exchange enables us to investigate the state space faster, the probability 0.6 is assigned to this move. The change of the join method as well as the change of join arguments has probability 0.2. </p><p>5.2. COST FUNCTION </p><p>The cost function, corresponding to an execution time of a query, gives a cost for every state, taking into account parameters of relations, join methods, the size of available memory, existence of indexes, etc. There are a few cost formulas which are used, depending on the kind of join method </p></li><li><p>82 M. MATYS IAK </p><p>TABLE 1 Tabu Search Parameters </p><p>Parameters Value </p><p>initial state next state </p><p>local minimum stopping condition length of tabu list </p><p>first local minimum found the best neighbor, not on tabu list r-local minimum (20 neighbors) time limit 2O </p><p>TABLE 2 Iterative Improvement Parameters </p><p>Parameter Value </p><p>initial state next state new starting state local minimum stopping condition </p><p>random state random neighbor random state r-local minimum (20 neighbors) both time and local optimum </p><p>and the existence of indexes. These formulas are based on the following assumptions: (1) there is no pipeline processing, i.e., all base relat ions are being fetched from a disk as well as all intermediate join results are mater ia l ized; (2) sizes of indexes are much smaller than sizes of associated relations; (3) minimal buffering for operat ions; (4) size of a disk page is 4 KB; and (5) t ransmiss ion cost f rom/to a disk is ten t imes greater than CPU processing cost. </p><p>5.3. COMBINATORIAL ALGORITHM-SPECIF IC PARAMETERS </p><p>There are several implementat ion-speci f ic features and parameters of the algorithms. These parameters influence the performance of the algorithms, and can be tuned to increase the qual i ty of the outcomes. A summary of these features and algor i thm parameters is presented below in Tables 1-3. </p><p>6. PARAMETERS OF EXPERIMENTS </p><p>In this exper iment, various parameters of relat ions and queries were generated to find the performance of a lgor i thms in different c ircumstances. The query size ranges from 10 to 100 joins. Joining relat ions and their jo in ing at t r ibutes are selected at random. The query type is specified </p></li><li><p>EFFIC IENT OPTIMIZATION OF JOIN QUERIES 83 </p><p>TABLE 3 Simulated Annealing Parameters </p><p>Parameter Value </p><p>Initial temperature max_cost(R)-min_cost(R), where R is a set of 20 random states exponential according to a time limit random state random neighbor 1/10 of a time limit always at the end of time limit </p><p>temperature reduction initial state next state inner-loop criterion systems has frozen </p><p>with parameter called ft. It takes on values from 1 up to 10. This range corresponds to various query graphs between linear trees, through bushy up to starlike trees. </p><p>The database profile used in experiments is characterized by the follow- ing parameters. Cardinalities of relations were selected between 100-1000 with probability 0.2, 1000-10000 with probability 0.6, and 10 000-50 000 with probability 0.2. The ballast, or width of relations without joining at- tributes, was randomly chosen between 50 and 500. All joining attributes have the same width, and each has randomly chosen domain cardinality. Indexes for relations were chosen at random with probability 0.3. In addi- tion, the following assumption was made. If there is an index for a relation, then it means that all joining attributes in this relation have indexes. There is a uniform distribution of attribute values and independence of values in the join attributes [15]. </p><p>7. RESULTS OF EXPERIMENTS </p><p>The algorithms were implemented in...</p></li></ul>