estimating costs of path expression evaluation in...
TRANSCRIPT
R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 351–360, 2002. Springer-Verlag Berlin Heidelberg 2002
Estimating Costs of Path Expression Evaluation in Distributed Object Databases
Gabriela Ruberg, Fernanda Baião, and Marta Mattoso
Department of Computer Science – COPPE/UFRJ P.O.Box 68511, Rio de Janeiro, RJ, 21945-970 – Brazil {gruberg, baiao, marta}@cos.ufrj.br
Abstract. Efficient evaluation of path expressions in distributed object databases involves choosing among several query processing strategies, due to the rich semantics involved in object-based data models and to the complexity added by the distribution. This work presents a new cost model for object-based query processors and addresses relevant issues, which are ignored or relaxed in other works in the literature, such as the selectivity of the path expression, the sharing degree of the referenced objects, the partial participation of the collections in the relationships, and the distribution of the database objects across the nodes of a network. These issues allowed us to present more realistic estimates for the query optimizer. Our cost model has been validated against experimental results obtained with an object DBMS prototype running in a distributed architecture, using the OO7 benchmark application.
1 Introduction
The development of realistic and efficient query optimizers is extremely important in enhancing the performance of database systems. In current query languages, path expression processing optimization is a central and difficult issue. Reference attributes in path expressions provide direct (pointer) access through object navigation, such as in object databases, or element navigation in XML [6]. The choice of the best execution plan to process a query with reference attributes is not simple for the query optimizer to make, due to the large number of execution strategies and algorithms to evaluate a path expression. Relevant issues must be considered for this problem, including choosing from binary or n-ary operators, pointer- or value-based algorithms, forward or reverse evaluation directions. These issues are not fully addressed in current cost models for object-based query optimizers, compromising the accuracy of estimate models.
In a distributed environment, the query execution search space is even larger because of fragmented data. Distributed data processing is becoming popular due to performance gains obtained from PC clusters, grid computing, and the Web, among others [11]. However, current path expression optimizers lack practical cost functions for ad-hoc queries in fragmented collections of objects. These functions may not be directly obtained from centralized cost models because some fragmented data may be previously disregarded during the query execution, modifying substantially the query
352 G. Ruberg, F. Baião, and M. Mattoso
costs. Even in a centralized context, a realistic cost model can not be obtained with a simple combination of relevant issues into a single model, because these issues are strongly related to each other and have to be remodeled. Table 1 identifies the presence of important issues in current object database cost models. Next, we discuss the impact of the issues of Table 1 in the cost estimates.
Estimating selectivity factor is essential to the performance analysis of query processing. The selectivity of path expressions can vary significantly according to partial or total participation of a class in a relationship. Partial participation means that only a subset of objects in a class are related to the objects of another class. However, most cost models [1, 2, 7, 8, 9, 10, 13] disregard partial participation. Cho et al.[6] present a realistic method for the estimation of selectivity factors, but only in centralized object databases.
A path expression can be evaluated in a forward direction (from the first to the last collection) or in a reverse direction (in the opposite way). Many cost models [1, 2, 7, 8, 9] are limited to the forward direction. The reverse direction is not obtained by simply changing the index variation, rather other parameters have to be added. The two basic algebra operators for path expression evaluation are the n-ary operator and the binary operator. The execution costs of a path expression may significantly vary for each pair (evaluation direction, algebra operator), according to the selectivity of the nested predicates and to the partial participation of the collections in the relationships of the path expression. A cost model restricted to a specific direction or execution strategy may prevent the query optimizer from choosing the best execution plan.
The amount of IO operations, estimated in terms of data pages, is often presented as the basic cost in the query processing [1, 2, 5, 8, 9, 10, 13], specially in a centralized execution. The object data model allows complex strategies due to the rich variety of constructors provided, and can drastically affect cost estimates of IO operations if techniques of object clustering are applied. This aspect has been disregarded by most cost models [1, 2, 5, 7, 8, 10, 13]. Very few works analyze CPU costs [9, 10]. Communication costs of distributed evaluation of path expressions in vertically and/or horizontally fragmented classes is not addressed in the literature. Almost all processing cost factors are very influenced by the size of available main memory, although this factor is usually not taken into account. In the small memory hypothesis, the IO reload overhead of a path expression evaluation is traditionally estimated using the collection fan-out parameter [8, 9, 10]. However, we have noticed that practically no additional IO operations are necessary if there is no object sharing in the relationships of the path expression, even if the fan-out is greater than one.
Works on distributed object-based cost functions are dedicated to algorithms for class partitioning in object databases. They are focused on the analysis of primary horizontal (P.H.F.) [1, 2, 7], derived horizontal (D.H.F.) [2], and vertical (V.F) [8] fragmentation methodologies. Their application in a real query optimizer is somewhat restricted, since they disregard important issues in the path expression evaluation, such as the object clustering policy, the evaluation direction, the binary operator and algorithms, and CPU costs. These issues are not trivially included in the cost model of the algorithms.
Estimating Costs of Path Expression Evaluation in Distributed Object Databases 353
Table 1. Important issues in path expression processing and related cost models
Issues / Cost Models [1] [2] [3] [5] [7] [8] [9] [10] [13] [14] Partial participation X X X Physical object clustering X X Evaluation direction X X X X X N-ary operator X X X X X X X X Binary operator X X X X IO overhead due to obj. sharing X IO costs X X X X X X X X
C E N T R A L I
Z. CPU costs X X X P.H.F. X X X X D.H.F. X X V.F. X X
D I S T. Communication costs X
We present a new cost model that covers the most representative algorithms for binary and n-ary operators, as well as forward and reverse directions for general path expression evaluation, in both centralized and distributed environments. An extended version of this work with detailed cost formulas may be found in [14]. In addition, our cost model has been validated against experimental results obtained in our previous work [15].
The remaining of this paper is organized as follows. Section 2 describes our cost model with emphasis in estimation of selectivity factors. Section 3 shows the validation of our cost model against experimental results, obtained with an object DBMS prototype using the OO7 benchmark. Finally, Section 4 draws some considerations and future work.
2 Cost Model
Invariably, the complexity of optimization problems requires some simplifications in the cost model. We assume that: (i) the query optimizer is able to break the encapsulation property; (ii) objects have a size less than a database page; (iii) the attribute values are uniformly distributed among instances of a class; and (iv) each object collection has just one class as its domain. These assumptions are present in other cost models [1, 2, 3, 5, 8, 9, 10, 13] since they occur in most object-based DBMS, as well as in their typical applications. Thus, they do not limit the expressive power of our cost model.
In our approach for estimating the cost of query execution plans, we consider that queries are issued against collections, thus some statistics are maintained for collections rather than for classes. The parameters of the fragments i
jF are represented
similarly to the parameters of the collections, adding the index j, ifj ≤≤1 . Therefore,
354 G. Ruberg, F. Baião, and M. Mattoso
ijSEL' represents the selectivity of the path expression over the fragment i
jF while
1,+ijiD is the total number of distinct pointers from Ci objects to 1+i
jF objects.
Table 2. Cost Model Parameters
2.1 Selectivity Factor of Path Expressions
The basis for evaluating query optimization strategies is the estimation of the selectivity factor of selection predicates and joins [5]. The selectivity factor of a path expression is the selectivity factor resultant from the nested predicates and the participation of each class collection in path relationships. Partial participation of a class collection influences the prediction of the path expression selectivity due not only to the estimation of the selectivity of implicit joins, but also due to the estimation of selectivity of nested predicates. Therefore, only the referenced objects in the path expression must be taken into account to estimate the selectivity factors over the collections.
When pointer-based algorithms are used, the path expression selectivity over a collection Ci represents the portion of the Ci objects that will be accessed during the path evaluation. Moreover, the path expression selectivity determines the cardinality of the intermediate results generated by join algorithms (pointer and value-based). Its computation over each collection Ci, ≤≤ i1 , also depends on the direction used to evaluate the path expression. Thus, given a path expression, we may express the number of distinct accessed objects in collection Ci during the path navigation as:
iii CSELREF ×= ´ . (1)
The term iSEL' , ≤≤ i1 , is obtained according to the evaluation direction:
• In forward, 1'1 =SEL and i
iiiii C
DSELSELSEL ,111'
' −−− ××= ; (2)
Param
Description
SELi Selectivity of nested predicate p
i
over Ci
iSEL' Selectivity of the path expression over C
i
iC Cardinality of Ci
iC # pages of Ci
iCS Average size of one object of Ci
fi # fragments of C
i
Zi-1,i Average # distinct pointers to C
i+1 objects from C
i objects that have
at least one non null reference
1, +iiD Total # distinct pointers from Ci
objects to Ci+1 objects
1, +iiX # Ci objects having all pointers to
Ci+1 objects as null references
ijsel Selectivity factor over the C
i
cardinality according to the ijF
cardinality REF
i # distinct accessed objects from
Ci in the path evaluation.
ijref Analogous to REF
i, in i
jF
Length of the path expression
Estimating Costs of Path Expression Evaluation in Distributed Object Databases 355
• In reverse, 1' =SEL and ( )i
iiiiii C
XCSELSELSEL 1,11'
' +++ −××= . (3)
Note that all objects in the starting collection (C1 or C , according to the evaluation
direction) are accessed because there is no filter from a previous relationship in the path expression.
In path expressions involving large collections with low selectivity factors, the traditional probabilistic method for selectivity estimation [1, 2, 3, 5, 8, 10] results in an expressive deviation from real values, as shown in section 3. This difference, which is avoided in our method, may be propagated to the estimation of page hits and to all costs that are based on the selectivity factor (IO, CPU and communication costs). Additionally, our method presents low computational complexity, thus improving processing costs in the optimization task.
Fragmentation Effects. Horizontal fragmentation distributes class instances among fragments (object collections) with the same structure, according to a given fragmentation criteria. Analogously, vertical fragmentation splits the logical structure of a class and distributes its attributes (and methods) among fragments with the same cardinality. Let Ci, ≤≤ i1 , be a collection of a path expression with primary horizontal or vertical fragmentation. During the evaluation of this path expression, the query processor can previously identify:
i) a horizontal fragment ijF ,
ifj ≤≤1 , where the selectivity of the associated
nested predicate pi is zero; or ii) a vertical fragment i
jF , ifj ≤≤1 , which attributes are not used in the query.
In both cases, we assume 0=ijSEL , thus causing the elimination of i
jF during the
query processing. If Ci is fragmented, only the set of fragments of Ci in which 0≠i
jSEL will be scanned during the query evaluation process. We may define the
Elimi subset containing all Ci fragments eliminated by ijSEL as:
{ }0|1, =≤≤= iji
iji SELfjFElim . (4)
In addition, we may define the subset Elim’i that refers to the derived horizontal fragments from Ci which were indirectly eliminated by the path expression selectivity (if their primary fragments were eliminated too), as follows:
( ) ( ){ }111|1,' −
−− ∈∧≤≤= iij
ij
iji
iji EFFFfjFElim . (5)
The term ij
ij FF 1− denotes that the primary fragment 1−i
jF determines the
derived fragment ijF , in the forward evaluation. The reverse evaluation formula is
obtained analogously to (5). We may define the set Ei, ≤≤ i1 , with cardinality #Ei, of all Ci fragments that will not be scanned during the path expression evaluation as:
iii Elim'ElimE ∪= . (6)
356 G. Ruberg, F. Baião, and M. Mattoso
We estimate the selectivity factor of Ci objects that belong to Ei as:
∑∈
=i
ij EF
iji selselE . (7)
The formal definition of set Ei and of its subsets, representing the fragmented data that is disregarded during the query evaluation, allows us to properly estimate the selectivity factors and execution costs of a distributed path expression evaluation.
Path Expression Selectivity in Horizontal Fragmentation. The number of distinct objects retrieved from a horizontally fragmented collection Ci, ≤≤ i1 , during the evaluation of a path expression is given by:
∑=
=if
j
iji refREF
1
, (8)
where ij
ij
ij FSELref ×= ' ,
ifj ≤≤1 . (9)
If iij EF ∈ then we have 0=i
jref . Otherwise, iij EF ∉ and i
jSEL' is calculated
according to both the horizontal fragmentation strategy of Ci (primary or derived) and to the path expression evaluation direction. In a forward evaluation1, ≤< i1 and
ifj ≤≤1 , we have:
• In P.H.F., 1'1 =jSEL and ij
ijiiii
jF
DSELSELSEL ,111'
' −−− ××= ; (10)
• In D.H.F., 1'1 =jSEL and ( )ij
ijiii
iji
jF
DSELSELpartSEL ,111'
' −−− ××= . (11)
In equation (11), the function ( )factorpartij
returns the participation of the
fragment ijF in the objects selected by factor from Ci. Modeling this participation is
important because if derived horizontal fragmentation is applied on Ci and some of its fragments are eliminated by their Ci-1 primary fragments, then only non-eliminated Ci fragments contribute to REFi objects. Indeed, the selectivity term (SEL’i-1 x SELi-1) is not proportionally distributed among all Ci fragments, but restricted to non-eliminated Ci fragments. Therefore:
( )( )
−×
+
==
otherwise. , 1
,1 if , 1
i
iij
selElim'
factorselElim'factor
factor
factorpart (12)
The selectivity factor selElim’i of objects from Ci fragments that were eliminated by the path expression selectivity is analogous to formula (7). Finally, the path
1 Estimation in reverse evaluation is obtained analogously to (3), applying the function
part(factor) if Ci has derived fragmentation.
Estimating Costs of Path Expression Evaluation in Distributed Object Databases 357
expression selectivity and the nested predicate selectivity over a collection Ci that is horizontally fragmented are given respectively by:
• 1' =SSEL and ( )∑=
×=if
j
ij
iji SELselSEL
1
'' ; (13)
• ( )∑=
×=if
j
ij
iji SELselSEL
1
. (14)
The term SSEL' represents the selectivity factor of the path expression in the
starting collection. Note that partial participation of collections in path relationships influences the estimation of the selectivity factors of each fragment involved in the path expression evaluation. In a distributed context, if total participation is assumed, then the difference from real values to estimates is even larger due to accumulation of many fragment deviations.
Path Expression Selectivity in Vertical Fragmentation. Let Ci, ≤≤ i1 , be a vertically fragmented collection where only one Ci vertical fragment contains the reference attribute used in the path expression navigation. The remaining Ci fragments are accessed during the query evaluation only if their attributes are necessary to probe the predicate pi. The selectivity factor of the path expression is the same in all Ci fragments and the total number of distinct Ci objects which are accessed during the path expression evaluation is obtained by:
ii refREF *= , (15)
where iii CSELref ×= '*
. (16)
The term iref* denotes the number of distinct Ci objects that are accessed in one Ci
vertical fragment. Note that SEL’i is obtained according to equations (2) and (3). However, each Ci object corresponds to fi stored objects, according to Ci vertical fragments. Therefore, we estimate the total number of Ci objects which are accessed in the non-eliminated vertical fragments during the path expression evaluation as:
( ) iiii ref#EfREF_v *×−= . (17)
Finally, the nested predicate pi has several selectivity factors according to Ci vertical fragments, thus its resultant selectivity factor is estimated as:
( )ij
EFi SELSEL
iij ∉
= min . (18)
Both vertical and horizontal fragmentation estimates may be easily combined to calculate the selectivity factors of hybrid fragmentation techniques.
358 G. Ruberg, F. Baião, and M. Mattoso
3 Experimental Analysis
In order to validate our cost model, we have compared its performance with results previously obtained [15] in practical experiments. These experimental results were obtained using the OO7 benchmark [4] on top of the GOA DBMS prototype [12].
Experimental and simulation results in terms of number of IO operations per query are shown in Figures 1 to 4. The results focus on the performance of the path expression evaluation in queries Q1-Q5 using strategy NP-F (forward naïve pointer chasing) and in queries Q1-Q2 using strategy VJ-R (reverse value-based join), disregarding the cost of displaying query results.
Figure 1 shows the number of IO operations that occurred in the execution of each path expression evaluation strategy in the centralized environment, and compares them to the predictions of our cost model, showing that the estimates are very close to all the evaluated scenarios. As expected, most of the predicted results are slightly higher than the experimental ones, since some cost model formulas calculate the worst case for disk random access. Queries Q3 and Q5, however, presented the
3244 3244
9609
1825
32443244
475
30253025
9463
518
30233023
0
2000
4000
6000
8000
10000
Q1-F Q2-F Q3-F Q4-F Q5-F Q1-R Q2-R
Cost model results Experimental results
#IO
ope
ratio
ns
3244 3244
9609
1825
32443244
475
30253025
9463
518
30233023
0
2000
4000
6000
8000
10000
Q1-F Q2-F Q3-F Q4-F Q5-F Q1-R Q2-R
Cost model results Experimental results
#IO
ope
ratio
ns
1863
849
478228
1952
498346
984
0
500
1000
1500
2000
2500
3000
2 4 8 12# nodes
#IO
ope
ratio
ns
Cost model results Experimental results
1863
849
478228
1952
498346
984
0
500
1000
1500
2000
2500
3000
2 4 8 12# nodes
#IO
ope
ratio
ns
Cost model results Experimental results
Fig. 1. NP-F and VJ-R execution IO cost(4Mbytes memory)
Fig. 2. IO cost per node of Q1-F execution in a distributed environment
0
20000
40000
60000
80000
100000
1 3 10
sharing degree
#IO
op
erat
ion
s
NP-F NP-R VJ-F VJ-R
5611
2804
1405943
5979
2990
1495997
0
1000
2000
3000
4000
5000
6000
7000
2 4 8 12
# nodes
#IO
ope
ratio
ns
Cost model results Experimental results
5611
2804
1405943
5979
2990
1495997
0
1000
2000
3000
4000
5000
6000
7000
2 4 8 12
# nodes
#IO
ope
ratio
ns
Cost model results Experimental results
Fig. 3. IO cost varying the sharing degree(4Mbytes memory)
Fig. 4. IO cost per node of Q4- F execution ina distributed environment
Estimating Costs of Path Expression Evaluation in Distributed Object Databases 359
experimental result somewhat higher than the predicted by the cost model. This is due to the fact that they are very fast queries, thus the overhead of catalog access in the real experiment was more predominant.
Query Q4-F is defined over two large collections (AtomicParts and Connections). We assume that ||AtomicParts||=100000 and ||Connections||=300000. According to [1, 10], the number of accessed objects in the collection Connections is estimated as X2=189637. Since the participation of the collections in the path expression is total, we observe that the real value of accessed objects from Connections should be 300000. According to our proposed formulas (1) and (2), the corresponding parameter is REF2=300000. This example shows a difference, which is avoided in our estimation method, of approximately 37% between the result obtained by the traditional probabilistic estimation method and the real result.
In Figure 3, we analyzed the effect of varying the sharing degree (1, 3, and 10 in each collection) of the objects along a path expression with 3= . The n-ary operator (NP-F and NP-R) has the worst behavior as share increases, since it ignores repeated object access and thus performs very poorly. The value-based join (VJ-F and VJ-R) presented a constant behavior because it avoids the bad effect of the object sharing and should be considered a good choice when object sharing is very high. This example shows that if the cost model does not consider the reverse direction or the value-based join algorithm, then the query execution strategy is limited to a very inefficient choice.
Queries Q1-F and Q4-F were executed in a distributed environment using 2, 4, 8, and 12 nodes. Figures 2 and 4 show the number of IO operations per node that occurred in the execution of each query. Our cost predictions are fairly close to values from experimental distributed execution, as in the centralized case.
4 Conclusions
Efficient processing of path expressions is fundamental for current query languages. The main contribution of this work is a new, realistic cost model to estimate the execution costs of evaluating path expressions in a distributed environment. The proposed cost model addresses binary and n-ary operators, as well as forward and reverse directions for path expression evaluation. It also considers issues such as the selectivity of the path expression, the sharing degree of the referenced objects which contributes to IO reload overhead estimate, physical clustering of the objects in disk, and the partial participation of the class collections in path relationships. These issues were combined and extended to encompass distributed processing, covering both horizontal (primary and derived) and vertical fragmentation of data.
We have shown the expressive deviation from real results in the traditional probabilistic method for estimation of the path expression selectivity when large collections with low selectivity factors are taken into account. Our selectivity estimation method avoids this deviation and presents low computational complexity, consequently diminishing processing costs in the optimization task. We also presented the limitations of always using the same algorithm and evaluation direction in path
360 G. Ruberg, F. Baião, and M. Mattoso
expression processing. The new cost model takes into account a large number of different factors, yet it remains fairly simple. The estimates generated by our cost model are very close to observed experimental results.
Currently we are working on extending this model for regular path expression processing. We are also experimenting the cost model to examine different strategies and new algorithms for evaluating path expressions.
Acknowledgement
This work was partially financed by CNPq and FAPERJ. The author G. Ruberg was supported by Central Bank of Brazil.
References
1. Bellatreche, L., Karlapalem, K., Basak, G.: Query-Driven Horizontal Class Partitioning for Object-Oriented Databases. DEXA 1998, 692-701
2. Bellatreche, L., Karlapalem, K., Li, Q.: Derived Horizontal Class Partitioning in OODBs: Design Strategies, Analytical Model and Evaluation. ER 1998, 465-479
3. Bertino, E., Foscoli, P.: On Modeling Cost Functions for Object-Oriented Databases. IEEE TKDE 9(3), 500-508 (1997)
4. Carey, M., DeWitt D., Naughton, J.: The OO7 Benchmark. ACM SIGMOD 22(2), 12-21 (1993)
5. Cho, W., Park, C., Whang, K., Son, S.: A New Method for Estimating the Number of Objects Satisfying an Object-Oriented Query Involving Partial Participation of Classes. Information Systems 21(3), 253-267 (1996)
6. Deutsch, A., Fernandez, M., et al.: Querying XML Data. IEEE Data Engineering Bulletin 22(3), 10-18 (1999)
7. Ezeife, C., Zheng, J.: Measuring the Performance of Database Object Horizontal Fragmentation Schemes. IDEAS 1999, 408-414
8. Fung, C. Karlapalem, K., Li, Q.: Cost-driven evaluation of vertical class partitioning in object oriented databases. DASFAA 1997, 11-20
9. Gardarin, G., Gruser, J., Tang, Z.: A Cost Model for Clustered Object-Oriented Databases. VLDB 1995, 323-334
10. Gardarin, G., Gruser, J., Tang, Z.: Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases. VLDB 1996, 390-401
11. Kossmann, D.: The State of the Art in Distributed Query Processing. ACM Computing Surveys 32(4), 422-469 (2000)
12. GOA++ Object Management System. URL: http://www.cos.ufrj.br/~goa 13. Ozkan, C., Dogac, A., Altinel, M.: A Cost Model for Path Expressions in Object Oriented
Queries. Journal of Database Management 7(3), 25-33 (1996) 14. Ruberg, G.: A Cost Model for Query Processing in Distributed-Object Databases, M.Sc.
Thesis in Portuguese, COPPE/UFRJ, Brazil (2001). Reduced version in English available in http://www.cos.ufrj.br/~gruberg/ruberg2001_english.pdf
15. Tavares, F.O., Victor, A.O., Mattoso, M.: Parallel Processing Evaluation of Path Expressions. SBBD 2000, 49-63