estimating costs of path expression evaluation in...

R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 351–360, 2002. Springer-Verlag Berlin Heidelberg 2002

Estimating Costs of Path Expression Evaluation in Distributed Object Databases

Gabriela Ruberg, Fernanda Baião, and Marta Mattoso

Department of Computer Science – COPPE/UFRJ P.O.Box 68511, Rio de Janeiro, RJ, 21945-970 – Brazil {gruberg, baiao, marta}@cos.ufrj.br

Abstract. Efficient evaluation of path expressions in distributed object databases involves choosing among several query processing strategies, due to the rich semantics involved in object-based data models and to the complexity added by the distribution. This work presents a new cost model for object-based query processors and addresses relevant issues, which are ignored or relaxed in other works in the literature, such as the selectivity of the path expression, the sharing degree of the referenced objects, the partial participation of the collections in the relationships, and the distribution of the database objects across the nodes of a network. These issues allowed us to present more realistic estimates for the query optimizer. Our cost model has been validated against experimental results obtained with an object DBMS prototype running in a distributed architecture, using the OO7 benchmark application.

1 Introduction

The development of realistic and efficient query optimizers is extremely important in enhancing the performance of database systems. In current query languages, path expression processing optimization is a central and difficult issue. Reference attributes in path expressions provide direct (pointer) access through object navigation, such as in object databases, or element navigation in XML [6]. The choice of the best execution plan to process a query with reference attributes is not simple for the query optimizer to make, due to the large number of execution strategies and algorithms to evaluate a path expression. Relevant issues must be considered for this problem, including choosing from binary or n-ary operators, pointer- or value-based algorithms, forward or reverse evaluation directions. These issues are not fully addressed in current cost models for object-based query optimizers, compromising the accuracy of estimate models.

In a distributed environment, the query execution search space is even larger because of fragmented data. Distributed data processing is becoming popular due to performance gains obtained from PC clusters, grid computing, and the Web, among others [11]. However, current path expression optimizers lack practical cost functions for ad-hoc queries in fragmented collections of objects. These functions may not be directly obtained from centralized cost models because some fragmented data may be previously disregarded during the query execution, modifying substantially the query

352 G. Ruberg, F. Baião, and M. Mattoso

costs. Even in a centralized context, a realistic cost model can not be obtained with a simple combination of relevant issues into a single model, because these issues are strongly related to each other and have to be remodeled. Table 1 identifies the presence of important issues in current object database cost models. Next, we discuss the impact of the issues of Table 1 in the cost estimates.

Estimating selectivity factor is essential to the performance analysis of query processing. The selectivity of path expressions can vary significantly according to partial or total participation of a class in a relationship. Partial participation means that only a subset of objects in a class are related to the objects of another class. However, most cost models [1, 2, 7, 8, 9, 10, 13] disregard partial participation. Cho et al.[6] present a realistic method for the estimation of selectivity factors, but only in centralized object databases.

A path expression can be evaluated in a forward direction (from the first to the last collection) or in a reverse direction (in the opposite way). Many cost models [1, 2, 7, 8, 9] are limited to the forward direction. The reverse direction is not obtained by simply changing the index variation, rather other parameters have to be added. The two basic algebra operators for path expression evaluation are the n-ary operator and the binary operator. The execution costs of a path expression may significantly vary for each pair (evaluation direction, algebra operator), according to the selectivity of the nested predicates and to the partial participation of the collections in the relationships of the path expression. A cost model restricted to a specific direction or execution strategy may prevent the query optimizer from choosing the best execution plan.

The amount of IO operations, estimated in terms of data pages, is often presented as the basic cost in the query processing [1, 2, 5, 8, 9, 10, 13], specially in a centralized execution. The object data model allows complex strategies due to the rich variety of constructors provided, and can drastically affect cost estimates of IO operations if techniques of object clustering are applied. This aspect has been disregarded by most cost models [1, 2, 5, 7, 8, 10, 13]. Very few works analyze CPU costs [9, 10]. Communication costs of distributed evaluation of path expressions in vertically and/or horizontally fragmented classes is not addressed in the literature. Almost all processing cost factors are very influenced by the size of available main memory, although this factor is usually not taken into account. In the small memory hypothesis, the IO reload overhead of a path expression evaluation is traditionally estimated using the collection fan-out parameter [8, 9, 10]. However, we have noticed that practically no additional IO operations are necessary if there is no object sharing in the relationships of the path expression, even if the fan-out is greater than one.

Works on distributed object-based cost functions are dedicated to algorithms for class partitioning in object databases. They are focused on the analysis of primary horizontal (P.H.F.) [1, 2, 7], derived horizontal (D.H.F.) [2], and vertical (V.F) [8] fragmentation methodologies. Their application in a real query optimizer is somewhat restricted, since they disregard important issues in the path expression evaluation, such as the object clustering policy, the evaluation direction, the binary operator and algorithms, and CPU costs. These issues are not trivially included in the cost model of the algorithms.

Estimating Costs of Path Expression Evaluation in Distributed Object Databases 353

Table 1. Important issues in path expression processing and related cost models

Issues / Cost Models [1] [2] [3] [5] [7] [8] [9] [10] [13] [14] Partial participation X X X Physical object clustering X X Evaluation direction X X X X X N-ary operator X X X X X X X X Binary operator X X X X IO overhead due to obj. sharing X IO costs X X X X X X X X

C E N T R A L I

Z. CPU costs X X X P.H.F. X X X X D.H.F. X X V.F. X X

D I S T. Communication costs X

We present a new cost model that covers the most representative algorithms for binary and n-ary operators, as well as forward and reverse directions for general path expression evaluation, in both centralized and distributed environments. An extended version of this work with detailed cost formulas may be found in [14]. In addition, our cost model has been validated against experimental results obtained in our previous work [15].

The remaining of this paper is organized as follows. Section 2 describes our cost model with emphasis in estimation of selectivity factors. Section 3 shows the validation of our cost model against experimental results, obtained with an object DBMS prototype using the OO7 benchmark. Finally, Section 4 draws some considerations and future work.

2 Cost Model

Invariably, the complexity of optimization problems requires some simplifications in the cost model. We assume that: (i) the query optimizer is able to break the encapsulation property; (ii) objects have a size less than a database page; (iii) the attribute values are uniformly distributed among instances of a class; and (iv) each object collection has just one class as its domain. These assumptions are present in other cost models [1, 2, 3, 5, 8, 9, 10, 13] since they occur in most object-based DBMS, as well as in their typical applications. Thus, they do not limit the expressive power of our cost model.

In our approach for estimating the cost of query execution plans, we consider that queries are issued against collections, thus some statistics are maintained for collections rather than for classes. The parameters of the fragments i

jF are represented

similarly to the parameters of the collections, adding the index j, ifj ≤≤1 . Therefore,


ijSEL' represents the selectivity of the path expression over the fragment i

jF while

1,+ijiD is the total number of distinct pointers from Ci objects to 1+i

jF objects.

Table 2. Cost Model Parameters

2.1 Selectivity Factor of Path Expressions

The basis for evaluating query optimization strategies is the estimation of the selectivity factor of selection predicates and joins [5]. The selectivity factor of a path expression is the selectivity factor resultant from the nested predicates and the participation of each class collection in path relationships. Partial participation of a class collection influences the prediction of the path expression selectivity due not only to the estimation of the selectivity of implicit joins, but also due to the estimation of selectivity of nested predicates. Therefore, only the referenced objects in the path expression must be taken into account to estimate the selectivity factors over the collections.

When pointer-based algorithms are used, the path expression selectivity over a collection Ci represents the portion of the Ci objects that will be accessed during the path evaluation. Moreover, the path expression selectivity determines the cardinality of the intermediate results generated by join algorithms (pointer and value-based). Its computation over each collection Ci, ≤≤ i1 , also depends on the direction used to evaluate the path expression. Thus, given a path expression, we may express the number of distinct accessed objects in collection Ci during the path navigation as:

iii CSELREF ×= ´ . (1)

The term iSEL' , ≤≤ i1 , is obtained according to the evaluation direction:

• In forward, 1'1 =SEL and i

iiiii C

DSELSELSEL ,111'

' −−− ××= ; (2)

Param

Description

SELi Selectivity of nested predicate p

i

over Ci

iSEL' Selectivity of the path expression over C

i

iC Cardinality of Ci

iC # pages of Ci

iCS Average size of one object of Ci

fi # fragments of C

i

Zi-1,i Average # distinct pointers to C

i+1 objects from C

i objects that have

at least one non null reference

1, +iiD Total # distinct pointers from Ci

objects to Ci+1 objects

1, +iiX # Ci objects having all pointers to

Ci+1 objects as null references

ijsel Selectivity factor over the C

i

cardinality according to the ijF

cardinality REF

i # distinct accessed objects from

Ci in the path evaluation.

ijref Analogous to REF

i, in i

jF

Length of the path expression


• In reverse, 1' =SEL and ( )i

iiiiii C

XCSELSELSEL 1,11'

' +++ −××= . (3)

Note that all objects in the starting collection (C1 or C , according to the evaluation

direction) are accessed because there is no filter from a previous relationship in the path expression.

In path expressions involving large collections with low selectivity factors, the traditional probabilistic method for selectivity estimation [1, 2, 3, 5, 8, 10] results in an expressive deviation from real values, as shown in section 3. This difference, which is avoided in our method, may be propagated to the estimation of page hits and to all costs that are based on the selectivity factor (IO, CPU and communication costs). Additionally, our method presents low computational complexity, thus improving processing costs in the optimization task.

Fragmentation Effects. Horizontal fragmentation distributes class instances among fragments (object collections) with the same structure, according to a given fragmentation criteria. Analogously, vertical fragmentation splits the logical structure of a class and distributes its attributes (and methods) among fragments with the same cardinality. Let Ci, ≤≤ i1 , be a collection of a path expression with primary horizontal or vertical fragmentation. During the evaluation of this path expression, the query processor can previously identify:

i) a horizontal fragment ijF ,

ifj ≤≤1 , where the selectivity of the associated

nested predicate pi is zero; or ii) a vertical fragment i

jF , ifj ≤≤1 , which attributes are not used in the query.

In both cases, we assume 0=ijSEL , thus causing the elimination of i

jF during the

query processing. If Ci is fragmented, only the set of fragments of Ci in which 0≠i

jSEL will be scanned during the query evaluation process. We may define the

Elimi subset containing all Ci fragments eliminated by ijSEL as:

{ }0|1, =≤≤= iji

iji SELfjFElim . (4)

In addition, we may define the subset Elim’i that refers to the derived horizontal fragments from Ci which were indirectly eliminated by the path expression selectivity (if their primary fragments were eliminated too), as follows:

( ) ( ){ }111|1,' −

−− ∈∧≤≤= iij

ij

iji

iji EFFFfjFElim . (5)

The term ij

ij FF 1− denotes that the primary fragment 1−i

jF determines the

derived fragment ijF , in the forward evaluation. The reverse evaluation formula is

obtained analogously to (5). We may define the set Ei, ≤≤ i1 , with cardinality #Ei, of all Ci fragments that will not be scanned during the path expression evaluation as:

iii Elim'ElimE ∪= . (6)


We estimate the selectivity factor of Ci objects that belong to Ei as:

∑∈

=i

ij EF

iji selselE . (7)

The formal definition of set Ei and of its subsets, representing the fragmented data that is disregarded during the query evaluation, allows us to properly estimate the selectivity factors and execution costs of a distributed path expression evaluation.

Path Expression Selectivity in Horizontal Fragmentation. The number of distinct objects retrieved from a horizontally fragmented collection Ci, ≤≤ i1 , during the evaluation of a path expression is given by:

∑=

=if

j

iji refREF

1

, (8)

where ij

ij

ij FSELref ×= ' ,

ifj ≤≤1 . (9)

If iij EF ∈ then we have 0=i

jref . Otherwise, iij EF ∉ and i

jSEL' is calculated

according to both the horizontal fragmentation strategy of Ci (primary or derived) and to the path expression evaluation direction. In a forward evaluation1, ≤< i1 and

ifj ≤≤1 , we have:

• In P.H.F., 1'1 =jSEL and ij

ijiiii

jF

DSELSELSEL ,111'

' −−− ××= ; (10)

• In D.H.F., 1'1 =jSEL and ( )ij

ijiii

iji

jF

DSELSELpartSEL ,111'

' −−− ××= . (11)

In equation (11), the function ( )factorpartij

returns the participation of the

fragment ijF in the objects selected by factor from Ci. Modeling this participation is

important because if derived horizontal fragmentation is applied on Ci and some of its fragments are eliminated by their Ci-1 primary fragments, then only non-eliminated Ci fragments contribute to REFi objects. Indeed, the selectivity term (SEL’i-1 x SELi-1) is not proportionally distributed among all Ci fragments, but restricted to non-eliminated Ci fragments. Therefore:

( )( )

−×

+

==

otherwise. , 1

,1 if , 1

i

iij

selElim'

factorselElim'factor

factor

factorpart (12)

The selectivity factor selElim’i of objects from Ci fragments that were eliminated by the path expression selectivity is analogous to formula (7). Finally, the path

1 Estimation in reverse evaluation is obtained analogously to (3), applying the function

part(factor) if Ci has derived fragmentation.


expression selectivity and the nested predicate selectivity over a collection Ci that is horizontally fragmented are given respectively by:

• 1' =SSEL and ( )∑=

×=if

j

ij

iji SELselSEL

1

'' ; (13)

• ( )∑=

×=if

j

ij

iji SELselSEL

1

. (14)

The term SSEL' represents the selectivity factor of the path expression in the

starting collection. Note that partial participation of collections in path relationships influences the estimation of the selectivity factors of each fragment involved in the path expression evaluation. In a distributed context, if total participation is assumed, then the difference from real values to estimates is even larger due to accumulation of many fragment deviations.

Path Expression Selectivity in Vertical Fragmentation. Let Ci, ≤≤ i1 , be a vertically fragmented collection where only one Ci vertical fragment contains the reference attribute used in the path expression navigation. The remaining Ci fragments are accessed during the query evaluation only if their attributes are necessary to probe the predicate pi. The selectivity factor of the path expression is the same in all Ci fragments and the total number of distinct Ci objects which are accessed during the path expression evaluation is obtained by:

ii refREF *= , (15)

where iii CSELref ×= '*

. (16)

The term iref* denotes the number of distinct Ci objects that are accessed in one Ci

vertical fragment. Note that SEL’i is obtained according to equations (2) and (3). However, each Ci object corresponds to fi stored objects, according to Ci vertical fragments. Therefore, we estimate the total number of Ci objects which are accessed in the non-eliminated vertical fragments during the path expression evaluation as:

( ) iiii ref#EfREF_v *×−= . (17)

Finally, the nested predicate pi has several selectivity factors according to Ci vertical fragments, thus its resultant selectivity factor is estimated as:

( )ij

EFi SELSEL

iij ∉

= min . (18)

Both vertical and horizontal fragmentation estimates may be easily combined to calculate the selectivity factors of hybrid fragmentation techniques.


3 Experimental Analysis

In order to validate our cost model, we have compared its performance with results previously obtained [15] in practical experiments. These experimental results were obtained using the OO7 benchmark [4] on top of the GOA DBMS prototype [12].

Experimental and simulation results in terms of number of IO operations per query are shown in Figures 1 to 4. The results focus on the performance of the path expression evaluation in queries Q1-Q5 using strategy NP-F (forward naïve pointer chasing) and in queries Q1-Q2 using strategy VJ-R (reverse value-based join), disregarding the cost of displaying query results.

Figure 1 shows the number of IO operations that occurred in the execution of each path expression evaluation strategy in the centralized environment, and compares them to the predictions of our cost model, showing that the estimates are very close to all the evaluated scenarios. As expected, most of the predicted results are slightly higher than the experimental ones, since some cost model formulas calculate the worst case for disk random access. Queries Q3 and Q5, however, presented the

3244 3244

9609

1825

32443244

475

30253025

9463

518

30233023

0

2000

4000

6000

8000

10000

Q1-F Q2-F Q3-F Q4-F Q5-F Q1-R Q2-R

Cost model results Experimental results

#IO

ope

ratio

ns

3244 3244

9609

1825

32443244

475

30253025

9463

518

30233023

0

2000

4000

6000

8000

10000

Q1-F Q2-F Q3-F Q4-F Q5-F Q1-R Q2-R


#IO

ope

ratio

ns

1863

849

478228

1952

498346

984

0

500

1000

1500

2000

2500

3000

2 4 8 12# nodes

#IO

ope

ratio

ns


1863

849

478228

1952

498346

984

0

500

1000

1500

2000

2500

3000

2 4 8 12# nodes

#IO

ope

ratio

ns


Fig. 1. NP-F and VJ-R execution IO cost(4Mbytes memory)

Fig. 2. IO cost per node of Q1-F execution in a distributed environment

0

20000

40000

60000

80000

100000

1 3 10

sharing degree

#IO

op

erat

ion

s

NP-F NP-R VJ-F VJ-R

5611

2804

1405943

5979

2990

1495997

0

1000

2000

3000

4000

5000

6000

7000

2 4 8 12

# nodes

#IO

ope

ratio

ns


5611

2804

1405943

5979

2990

1495997

0

1000

2000

3000

4000

5000

6000

7000

2 4 8 12

# nodes

#IO

ope

ratio

ns


Fig. 3. IO cost varying the sharing degree(4Mbytes memory)

Fig. 4. IO cost per node of Q4- F execution ina distributed environment


experimental result somewhat higher than the predicted by the cost model. This is due to the fact that they are very fast queries, thus the overhead of catalog access in the real experiment was more predominant.

Query Q4-F is defined over two large collections (AtomicParts and Connections). We assume that ||AtomicParts||=100000 and ||Connections||=300000. According to [1, 10], the number of accessed objects in the collection Connections is estimated as X2=189637. Since the participation of the collections in the path expression is total, we observe that the real value of accessed objects from Connections should be 300000. According to our proposed formulas (1) and (2), the corresponding parameter is REF2=300000. This example shows a difference, which is avoided in our estimation method, of approximately 37% between the result obtained by the traditional probabilistic estimation method and the real result.

In Figure 3, we analyzed the effect of varying the sharing degree (1, 3, and 10 in each collection) of the objects along a path expression with 3= . The n-ary operator (NP-F and NP-R) has the worst behavior as share increases, since it ignores repeated object access and thus performs very poorly. The value-based join (VJ-F and VJ-R) presented a constant behavior because it avoids the bad effect of the object sharing and should be considered a good choice when object sharing is very high. This example shows that if the cost model does not consider the reverse direction or the value-based join algorithm, then the query execution strategy is limited to a very inefficient choice.

Queries Q1-F and Q4-F were executed in a distributed environment using 2, 4, 8, and 12 nodes. Figures 2 and 4 show the number of IO operations per node that occurred in the execution of each query. Our cost predictions are fairly close to values from experimental distributed execution, as in the centralized case.

4 Conclusions

Efficient processing of path expressions is fundamental for current query languages. The main contribution of this work is a new, realistic cost model to estimate the execution costs of evaluating path expressions in a distributed environment. The proposed cost model addresses binary and n-ary operators, as well as forward and reverse directions for path expression evaluation. It also considers issues such as the selectivity of the path expression, the sharing degree of the referenced objects which contributes to IO reload overhead estimate, physical clustering of the objects in disk, and the partial participation of the class collections in path relationships. These issues were combined and extended to encompass distributed processing, covering both horizontal (primary and derived) and vertical fragmentation of data.

We have shown the expressive deviation from real results in the traditional probabilistic method for estimation of the path expression selectivity when large collections with low selectivity factors are taken into account. Our selectivity estimation method avoids this deviation and presents low computational complexity, consequently diminishing processing costs in the optimization task. We also presented the limitations of always using the same algorithm and evaluation direction in path


expression processing. The new cost model takes into account a large number of different factors, yet it remains fairly simple. The estimates generated by our cost model are very close to observed experimental results.

Currently we are working on extending this model for regular path expression processing. We are also experimenting the cost model to examine different strategies and new algorithms for evaluating path expressions.

Acknowledgement

This work was partially financed by CNPq and FAPERJ. The author G. Ruberg was supported by Central Bank of Brazil.

References

1. Bellatreche, L., Karlapalem, K., Basak, G.: Query-Driven Horizontal Class Partitioning for Object-Oriented Databases. DEXA 1998, 692-701

2. Bellatreche, L., Karlapalem, K., Li, Q.: Derived Horizontal Class Partitioning in OODBs: Design Strategies, Analytical Model and Evaluation. ER 1998, 465-479

3. Bertino, E., Foscoli, P.: On Modeling Cost Functions for Object-Oriented Databases. IEEE TKDE 9(3), 500-508 (1997)

4. Carey, M., DeWitt D., Naughton, J.: The OO7 Benchmark. ACM SIGMOD 22(2), 12-21 (1993)

5. Cho, W., Park, C., Whang, K., Son, S.: A New Method for Estimating the Number of Objects Satisfying an Object-Oriented Query Involving Partial Participation of Classes. Information Systems 21(3), 253-267 (1996)

6. Deutsch, A., Fernandez, M., et al.: Querying XML Data. IEEE Data Engineering Bulletin 22(3), 10-18 (1999)

7. Ezeife, C., Zheng, J.: Measuring the Performance of Database Object Horizontal Fragmentation Schemes. IDEAS 1999, 408-414

8. Fung, C. Karlapalem, K., Li, Q.: Cost-driven evaluation of vertical class partitioning in object oriented databases. DASFAA 1997, 11-20

9. Gardarin, G., Gruser, J., Tang, Z.: A Cost Model for Clustered Object-Oriented Databases. VLDB 1995, 323-334

10. Gardarin, G., Gruser, J., Tang, Z.: Cost-based Selection of Path Expression Processing Algorithms in Object-Oriented Databases. VLDB 1996, 390-401

11. Kossmann, D.: The State of the Art in Distributed Query Processing. ACM Computing Surveys 32(4), 422-469 (2000)

12. GOA++ Object Management System. URL: http://www.cos.ufrj.br/~goa 13. Ozkan, C., Dogac, A., Altinel, M.: A Cost Model for Path Expressions in Object Oriented

Queries. Journal of Database Management 7(3), 25-33 (1996) 14. Ruberg, G.: A Cost Model for Query Processing in Distributed-Object Databases, M.Sc.

Thesis in Portuguese, COPPE/UFRJ, Brazil (2001). Reduced version in English available in http://www.cos.ufrj.br/~gruberg/ruberg2001_english.pdf

15. Tavares, F.O., Victor, A.O., Mattoso, M.: Parallel Processing Evaluation of Path Expressions. SBBD 2000, 49-63

estimating costs of path expression evaluation in...

Documents