efficient evaluation of queries in a mediator for …viz/papers/sigmod02.pdf · efficient...

12
Efficient Evaluation of Queries in a Mediator for WebSources • Vladimir Zadorozhny University of Pittsburgh Pittsburgh, PA 15260 [email protected] t .edu Louiqa Raschid Maria Esther Vidal University of Maryland Simon Bolivar University College Park, MD 20742 Caracas, Venezuela [email protected] [email protected] Laura Bright University of Maryland College Park, MD 20742 bright @cs.umd.edu Tolga Urhan BEA Systems, Inc. San Jose, CA 95131 [email protected] ABSTRACT We consider an architecture of mediators and wrappers for Internet accessible WebSources of limited query capability. Each call to a source is a WebSource Implementation (WSI) and it is associated with both a capability and (a possibly dynamic) cost. The multiplicity of WSIs with varying costs and capabilities increases the complexity of a traditional op- timizer that must assign WSIs for each remote relation in the query while generating an (optimal) plan. We present a two- phase Web Query Optimizer (WQO). In a pre-optimization phase, the WQO selects one or more WSIs for a pre-plan; a pre-plan represents a space of query evaluation plans (plans) based on this choice of WSIs. The WQO uses cost-based heuristics to evaluate the choice of WSI assignment in the pre-plan and to choose a good pre-plan. The WQO uses the pre-plan to drive the extended relational optimizer to ob- tain the best plan for a pre-plan. A prototype of the WQO has been developed. We compare the effectiveness of the WQO, i.e., its ability to efficiently search a large space of plans and obtain a low cost plan, in comparison to a tradi- tional optimizer. We also validate the cost-based heuristics by experimental evaluation of queries in the noisy Internet environment. 1. INTRODUCTION The rapid growth of the Internet and Intranets and the emergence of XML for data interchange has increased the opportunity for wide area applications against WebSources that are accessible over a wide area network via scripts or *This research has been partially supported by the Defense Advanced Research Project Agency under grant 01-5-28838; the National Science Foundation under grant IRI9630102 and by CONICIT, Venezuela. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the fifll citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGMOD 2002 June 4-6, Madison, Wisconsin, USA Copyright 2002 ACM 1-58113-497-5/02/06...$5.00. form-based interfaces. An example application domain are the hundreds of biomolecular data sources. Architectures that have been developed for heterogeneous DBMS have to be tailored to this new environment. There are several char- acteristics of WebSources that must be considered. First, unlike a relational DBMS that will accept a wide range of queries, e.g., any query that can be expressed in the rela- tional algebra, a WebSource supports limited query capabil- ity. Typically, most forms-based interfaces impose a restric- tion that one or more input attributes must have a binding, and also restrict the output attributes that are projected. Second, the sources are autonomous and are accessed over wide area networks that are dynamic. These sources typ- ically do not provide all of the cost metrics that are used in cost-based query optimization. In addition, the dynamic nature of the wide area network can introduce delays that significantly vary access costs. A specific wrapper call to a remote WebSource is labeled a WebSource Implementation (WSI) and corresponds to a limited query capability and its (possibly varying) cost. The challenges of diverse WSIs take an even greater sig- nificance in a domain such as the biomolecular data sources. First, the domain is characterized by a multiplicity of Web- Sources with diverse and possibly complex capability. There is significant overlap in the contents of the sources as well as relationships (links) among their contents. Second, the process of knowledge discovery involves complex queries on multiple data sources, where each query could execute for hundreds of seconds or longer [9, 17]. This multiplicity of diverse WSIs with varying costs makes the task of query optimization very difficult [10]. In this paper, we consider an architecture of wrappers and mediators, and multiple WebSources and WSIs. Our first contribution is a two-phase optimization approach that has been implemented in a Web Query Optimizer (WQO). The multiplicity of diverse WSIs with varying costs and ca- pabilities can significantly increase the search space of an optimizer that assigns WSIs for each remote relation of the query. Traditional single phase optimizer must exhaustively consider all possible combinations of WSIs, while simultane- ously generating a low cost (optimal) query evaluation plan (plan). Further, dynamic costs will have a negative impact on the robustness of a traditional cost model that expects accurate cost estimation. The key idea of two-phase opti- 85

Upload: vuongnhan

Post on 27-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Efficient Evaluat ion of Quer ies in a Mediator for W e b S o u r c e s •

Vladimir Zadorozhny University of Pittsburgh

Pittsburgh, PA 15260 [email protected] t .edu

Louiqa Raschid Maria Esther Vidal University of Maryland Simon Bolivar University College Park, MD 20742 Caracas, Venezuela [email protected] [email protected]

Laura Bright University of Maryland

College Park, MD 20742 bright @cs.umd.edu

Tolga Urhan BEA Systems, Inc. San Jose, CA 95131 [email protected]

ABSTRACT We consider an architecture of mediators and wrappers for Internet accessible WebSources of limited query capability. Each call to a source is a WebSource Implementation (WSI) and it is associated with both a capability and (a possibly dynamic) cost. The multiplicity of WSIs with varying costs and capabilities increases the complexity of a traditional op- timizer that must assign WSIs for each remote relation in the query while generating an (optimal) plan. We present a two- phase Web Query Optimizer (WQO). In a pre-optimization phase, the WQO selects one or more WSIs for a pre-plan; a pre-plan represents a space of query evaluation plans (plans) based on this choice of WSIs. The WQO uses cost-based heuristics to evaluate the choice of WSI assignment in the pre-plan and to choose a good pre-plan. The WQO uses the pre-plan to drive the extended relational optimizer to ob- tain the best plan for a pre-plan. A prototype of the WQO has been developed. We compare the effectiveness of the WQO, i.e., its ability to efficiently search a large space of plans and obtain a low cost plan, in comparison to a tradi- tional optimizer. We also validate the cost-based heuristics by experimental evaluation of queries in the noisy Internet environment.

1. INTRODUCTION The rapid growth of the Internet and Intranets and the

emergence of XML for data interchange has increased the opportunity for wide area applications against WebSources that are accessible over a wide area network via scripts or

*This research has been partially supported by the Defense Advanced Research Project Agency under grant 01-5-28838; the National Science Foundation under grant IRI9630102 and by CONICIT, Venezuela.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the fifll citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGMOD 2002 June 4-6, Madison, Wisconsin, USA Copyright 2002 ACM 1-58113-497-5/02/06...$5.00.

form-based interfaces. An example application domain are the hundreds of biomolecular data sources. Architectures that have been developed for heterogeneous DBMS have to be tailored to this new environment. There are several char- acteristics of WebSources that must be considered. First, unlike a relational DBMS that will accept a wide range of queries, e.g., any query that can be expressed in the rela- tional algebra, a WebSource supports limited query capabil- ity. Typically, most forms-based interfaces impose a restric- tion that one or more input attributes must have a binding, and also restrict the output attributes that are projected. Second, the sources are autonomous and are accessed over wide area networks that are dynamic. These sources typ- ically do not provide all of the cost metrics that are used in cost-based query optimization. In addition, the dynamic nature of the wide area network can introduce delays that significantly vary access costs. A specific wrapper call to a remote WebSource is labeled a WebSource Implementation (WSI) and corresponds to a limited query capability and its (possibly varying) cost.

The challenges of diverse WSIs take an even greater sig- nificance in a domain such as the biomolecular data sources. First, the domain is characterized by a multiplicity of Web- Sources with diverse and possibly complex capability. There is significant overlap in the contents of the sources as well as relationships (links) among their contents. Second, the process of knowledge discovery involves complex queries on multiple data sources, where each query could execute for hundreds of seconds or longer [9, 17]. This multiplicity of diverse WSIs with varying costs makes the task of query optimization very difficult [10].

In this paper, we consider an architecture of wrappers and mediators, and multiple WebSources and WSIs. Our first contribution is a two-phase optimization approach that has been implemented in a Web Query Optimizer (WQO). The multiplicity of diverse WSIs with varying costs and ca- pabilities can significantly increase the search space of an optimizer that assigns WSIs for each remote relation of the query. Traditional single phase optimizer must exhaustively consider all possible combinations of WSIs, while simultane- ously generating a low cost (optimal) query evaluation plan (plan). Further, dynamic costs will have a negative impact on the robustness of a traditional cost model that expects accurate cost estimation. The key idea of two-phase opti-

85

mization is to separate the choice of WSI assignments, for each remote relation in the query, from the cost based op- timization step, and to select WSIs prior to optimization. In the first phase, a pre-optimizer is responsible for creating a pre-plan and choosing appropriate WSI assignments for the pre-plan. A pre-plan is an abstraction that represents a space of multiple query evaluation plans (plans) for the query. The pre-optimizer evaluates the WSI assignments of each pre-plan to identify a good pre-plan. In the sec- ond phase, the WQO uses an extended relational optimizer to explore the space of plans, and generate the best plan for the particular choice of WSI assignments in the selected pre-plan(s).

A good WSI assignment in the pre-plan is one that will lead to a good low cost plan. In a domain with multiple WSIs, possibly complex query capabilities, and varying met- rics that affect the cost of a plan, and the accuracy of the cost model, the choice of WSI assignment is not obvious. Under these circumstances, the benefit of the first phase of pre-optimization is the opportunity to evaluate good WSI assignments, prior to, and independent of, the costly task of generating the best plan. Thus, our second contribution is a set of cost-based heuristics to evaluate the choice of WSIs assignments in the pre-plan. Our cost-based heuris- tics consider the following issues in evaluating WSIs in the pre-plan: (1) The pre-optimizer will consider alternate eval- uation strategies such as top-down versus bottom-up evalua- tion of the mediator queries. The choice of top-down versus bottom-up evaluation is typically determined by the limited query capability of the WSIs. (2) The pre-optimizer will ex- plore evaluation strategies based upon the choice of atomic versus composed WSIs. A composed WSI will typically sub- mit multiple wrapper calls (queries) to a remote WebSource compared to an atomic WSI. (3) The pre-optimizer will ex- plore access cost metrics characterizing multiple WSIs with similar query capabilities that impact the cost of the plan. Metrics that will be considered include latency, result car- dinality and query selectivity of input bindings. While we do not claim that these heuristics provide complete infor- mation on the cost of a plan, they may be more suitable for the dynamic wide area environment characterized by vary- ing costs, compared to a traditional and complex cost based optimizer which relies on accurate cost estimates.

The WQO uses an extended (randomized) relational opti- mizer that is based on traditional (cost-based) optimization strategies, in the second phase, to generate a plan. The pre- plan knowledge of the WSIs is used by the relational opti- mizer, and it respects the pre-plan, and the limited query capabilities of the WSIs.

A prototype of the WQO for a mediator wrapper architec- ture of WebSources was constructed, extending the Predator Object-Relational DBMS [31]. The effectiveness of an opti- mizer is measured by its ability to generate low cost plans, while efficiently navigating the search space of plans. Thus, it considers both optimization time and the cost of the plan. We compare the effectiveness of the WQO to a traditional optimizer. We also validate that the cost-based heuristics are indeed capable of selecting good WSI assignments that lead to good low cost plans. We do so by performing an experimental evaluation of several queries on WebSources. The experimental evaluation indicates that the dynamic na- ture of the wide area environment affects the WebSource costs (statistics and access costs), and makes it difficult to

characterize the execution of some plans as always good or always poor. This would confuse a cost based optimizer. However, we could validate that our heuristics can differ- entiate when the choice of WSIs leads to typically good or bad plans. Thus, our third contribution is the implemen- tation and experimental validation of the mediator and its heuristics.

There are several areas of research that are relevant to our work. There has been much work in capability based rewriting of mediator queries with limited query capability of sources [11, 12, 16, 23, 24, 25, 30, 39, 40, 41, 43]. While we use the results of this research, this paper does not di- rectly make a contribution in this area. There has also been some research on estimating the costs for accessing hetero- geneous sources [1, 8, 27, 28]. There has also been some research on considering both capability and costs in query optimization. The most extensive research in Garlic project [27] has shown that both factors can impact the choice of a good plan. We will compare our approach of using cost- based heuristics with this work and identify our contribu- tions. The WSQ/DSQ project [14] also considers the costs of combining database query processing with Web queries.

The paper is organized as follows: Section 2 provides an example mediator schema and WebSource capabilities. It describes the task of the Capability Based Rewriting (CBR) in producing a pre-plan. Section 3 then presents the WQO optimizer. We first describe the two phase approach to op- timization. We briefly discuss the cost-based heuristics used by the pre-optimizer to evaluate the WSI assignments. We then discuss the modifications to a relational optimizer. Sec- tion 4 describes the experimental evaluation of the effec- tiveness of the WQO, i.e., its ability to efficiently generate low cost plans. We verify that the cost-based heuristics can be effectively used by the WQO to choose good WSI as- signments and produce a good plan in a noisy environment. Section 5 compares our approach with related work and concludes.

2. QUERY CAPABILITIES AND THE CBR TOOL

Scientific discovery with biomolecular data sources is an example application domain that would benefit from the re- search described in this paper. This domain is characterized as follows: There are alternate sources with multiple and possibly complex query capabilities. A query submitted to a mediator in this domain is typically a complex query whose evaluation involve access to multiple sources, and execution times may be in the hundreds of seconds or greater. Me- diation is required, i.e., the data from these sources cannot be downloaded and stored in a warehouse. Queries must be submitted to the remote data sources that support complex query capabilities which cannot be easily replicated locally by the mediator, e.g., sequence BLAST. We refer the reader to [9, 10, 17] for details.

Understanding the queries in this example domain re- quires significant domain knowledge. To make our research more accessible, in this paper, we present an example of a much simpler domain. In this section, we briefly review the mediator schema, the limited query capabilities, and capa- bility based rewriting (CBR).

86

2.1 An Example of Query Capability Consider a relational mediator schema with relations Paper,

CoAuthor, Reviewer and E d i t o r . Suppose the ACM digi- tal l ibrary (ACM DL) [26] implements these relations. We describe the limited query capability of this source as an input-output relationship ior: Input ~ Output, on a rela- tion, where Inputis the set of at t r ibutes that must be bound and Output is the set of projected output attr ibutes. Each iori specifies a possible query that can be submit ted to the remote WebSource. It is implemented as a particular Web-- Source Implementation (WSI). Below we use terms ior and WSIinterchangeably. A part of the limited query capability of the ACM DL is as follows:

Paper(lstAuthor, Title, PaperSrc, PaperId, Keywords) iorl: {lstAuthor} ~ {Title, PaperId, Keywords, PaperSrc ) ior3: {lstAuthor} --~ {Title, PaperId, Keywords } ior4: {Paperld}--+ {PaperSre)

CoAuthor (PaperId, CoAuthor) ior2: {Paperld} ~ {CoAuthor}

Editor(PaperId, EName) iors: {Paperld}--+ {EName }

Reviewer(Paperld, RNoane) ior6: {} --+ {Paperld, RName }

To explain the limited query capability, the name of the l s t A u t h o r must be bound to obtain tuples from Paper, if iorl is assigned to Paper - Paper( ior , ) . Similarly, the value of Pape r Id must be bound to find the co-authors from CoAuthor(ior2), and the editor from Edi tor ( iors ) . The query capability described by iorl is equivalent to the com- bined query capabilities of ior3 and ior4. iorl is an atomic WSI whereas the combination of iora and ior4 forms a com- posed WSI ((lOrd;ion).

2.2 The CBRtool Capabili ty based rewriting of queries is performed by the

C B R tool [41]. A query is represented as a set of subgoals on mediator relations and some at t r ibute bindings. The CBR tool then accomplishes the following tasks: (a) It determines if a query can be accepted, i.e., there are some WebSources and WSIs that can provide an answer. (b) For each mediator subgoal in the query, it identifies all the relevant WSIs.

The problem solved by the CBR Tool is the Accepted- Query problem [41] and several heuristic solutions to this problem and related problems are available [11, 12, 16, 23, 24, 25, 30, 39, 40, 41, 43]. While our optimizer uses capabil- ity based rewriting, this is not the focus of this paper, and we provide a brief overview of the CBR Tool in this section.

Consider the following query expressed in an SQL-like syntax:

Select Title, PaperSrc, Coauthor From Paper, CoAuthor, Editor, Reviewer Where lstAuthor=" Franklin"

and Paper.Paperld=CoAuthor.Paperld and Paper.Paperld=Reviewer.Paperld and Paper.Paperld = Editor.Paperld

This query has four subgoals on the four mediator rela- tions Paper, CoAuthor, E d i t o r and Reviewer. The CBR Tool will determine that there is a partit ioning of the sub- goals {Paper( ior0, Reviewer(ior6)}, followed by {CoAuth- or(ior2), Edi tor( iors)} . There are two dependencies Pa- per ( io r l ) ~ CoAuthor(ior2) and Paper( ior l ) ~ Edi tor ( iors ) , which are imposed by the ior of ACM DL, and which re- sult in the partition. The dependencies exist because the

at t r ibute Pape r Id which can be obtained from subgoal Pa- per(iora) is a required input a t t r ibute of the CoAuthor in subgoai CoAuthor(lot2) (see ior2). Similarly, Pape r Id is a required input of E d i t o r in subgoai Ed i to r ( lo t s ) (see iors). Thus, CoAuthor(ior2) and Edi to r ( io r s ) cannot precede Pa- per ( ior l ) in any query evaluation plan. This is a restriction on the space of plans for this query. In the optimization step, this ordering is respected and a dependent join opera- tor [6, 11] will be introduced in the plan. Improved solutions for evaluation involving the dependent join have also been proposed in WSQ/DSQ [14].

3. TWO PHASE APPROACH OF THE WEB QUERY OPTIMIZER (WQO)

We first describe how a tradit ional optimizer would be extended in a straightforward manner to search among mul- tiple WSI assignments for a good (optimal) plan. Next, we describe our two-phased approach to optimization, and compare the two approaches. To simplify the discussion, we assume that there are no dependencies in the pre-plans.

Consider an n-way join over remote mediator relations, R 1 , . . . , Rn whose contents reside on remote WebSources. Suppose each relation Ri has m relevant WebSource Im- plementations W~,...,Wm.i During traditional optimization, the optimizer must assign some WSI wj to each remote rela- tion Ri. To choose a(n) (optimal) plan, the optimizer must consider all possible assignments w l , . . . , w ~ for Ri. For each assignment, the optimizer will generate the best plan. The total number of assignments that must be considered in the example is m". Thus, the total optimization time is OptTime × m '~, where OptTime is the time to generate the best plan for each WSI assignment. We know that the size of the search space for obtaining the best plan is large. When a System R style optimizer only considers left-deep join trees, and when there are no dependencies the complexity is (n!) [32, 34, 35]. Thus, the performance of this optimizer will de- grade if it must repeat the costly optimization (OptTime) for each WSI assignment.

The key idea of two-phase optimization is to separate the choice of WSI assignments to remote relations from the op- timization step, and to select WSIs prior to optimization. In the first phase, a pre-optimizer is responsible for WSI assignment. The pre-optimizer uses the CBR Tool and gen- erates a pre-plan which specifies WSI assignment and some additional information (including partit ions and dependen- cies). The pre-optimizer evaluates the WSI assignments of the pre-plan to identify a good choice of WSIs. In the second phase, we use an extended relational optimizer to generate the best plan for the particular choice of WSI as- signments in the pre-plan. The total optimization time is PreOptTime +C x OptTime where PreOptTime is the pre- optimization time, and C is the number of pre-plans that are evaluated in the second phase of the optimizer.

When the choice of a good WSI assignment (pre-plan) for each remote relation is clear, then it is obvious that the tra- ditional optimizer could be easily modified to only consider good WSI choices. However, in many cases, the choice of a good WSI assignment is not simple. Suppose we consider two WSIs for a remote relation where both WSIs have sim- ilar query capability. However, the two WSIs have different end-to-end response times, as well as two different values for the cardinality of the answer that is returned. These metrics

87

will affect the cost of the plan. Such situations are very com- mon in the motivating biological data source application. In this case, the traditional optimizer will have to exhaustively consider all possible WSI assignments, and run the optimizer for each assignment, to choose the best plan. In contrast, during two-phase optimization, the pre-optimizer will eval- uate the WSI assignments and corresponding pre-plans to select (one or more) good pre-plans using some heuristics. It will then run the optimizer to obtain a low cost plan, for this choice.

3.1 Architecture Figure 1 presents our wrapper mediator architecture. The

mediator is an extension of the Predator ORDBMS [31]; our mediator uses the relational data model. The WebSources used in our experiments include the ACM Digital Library (ACM DL) [26]. Web Wrappers [4] were built to reflect the limited capability of the sources. The Web Wrapper cost model provides relevant statistics and access costs for the WebSource. A Web Wrapper Query Broker provides interop- erability between Predator (in C + + ) and Web Wrappers (in Java) a la CORBA. This includes finding the correct WSI for a mediator subgoM, and providing an appropriate mapping for output objects. The CBR Tool has been implemented in Quintus Prolog and has been integrated into the Preda- tor mediator. The Predator evaluation engine was extended with several operators to implement the limited query capa- bility of WebSources as well as adaptive operators. These include an external scan operator, a dependent join operator [6, 11], and XJOIN [36, 37].

3.2 Pre-Optimizer: Using the CBR Tool to gen- erate Pre-plans

For each accepted query, the CBR tool generates a set of pre-plans. A pre-plan is a data structure that provides the following information: (1) a partition and ordering of mediator subgoals; (2) relevant WebSource implementations (WSIs) that can evaluate the mediator subgoal (one for each subgoal), and (3) restrictions on queries to the relevant Web- Sources. The restrictions are (1) attributes that require bindings, and (2) attributes that must be output from the WebSource. Details of the information in the pre-plan is in [41].

Consider the query on ACM DL with four subgoals on the mediator relations Paper, CoAuthor, E d i t o r and Reviewer. The dependencies imposed by Paper ( ior l ) and Paper( ior3 ; ior4 ) on CoAuthor(lot2) Edi tor ( iors ), lead to the following two pre-plans: pre-plan 1: { { Paper(iorl), Revlewer(iors)}, { CoAuthor(lot2), Editor( iors ) } } O { Paper( iorl ) --~ CoAuthor(lot2), Paper( iorl ) --~ Editor(lots) } pre-plan 2: { { Paper(ior3;ior4), l=teviewer(ior6) }, { CoAuthor(lot2), Edltor(iors) } } U { Paper(ior3;ior 4) --+ CoAuthor(ior2), Paper(lot3 ;ior4) ~ Editor(lots) }

A pre-plan is denoted as a union of two sets. The first set identifies the partition of the subgoals and the second set identifies the dependencies.

The pre-plans circumscribe a search space of several query evaluation plans. The join ordering of the subgoals must respect the dependencies of the pre-plan. For example, all plans in this space require that the subgoal on Paper precede the subgoals CoAuthor and Edi tor . There are no restrictions

on the orderings between subgoals on CoAuthor, E d i t o r and Reviewer.

3.3 Cost-Based Heuristics for Evaluating Pre- Plans in the Pre-Optimizer

The task of the pre-optimizer is to choose a good pre-plan that will lead to a low cost query evaluation plan. Recall that a pre-plan circumscribes a space of many query evalua- tion plans (plans), and each of the plans may have a different cost. Finding the least cost plan is the responsibility of the optimizer and occurs in the second phase of optimization. In the first phase, our pre-optimizer does not calculate an accurate cost for any (all) of the plans corresponding to a pre-plan. Instead, it uses some cost-based heuristics to de- termine the impact of some WSI assignment, and its cost metrics, on the potential cost of any (all) plans correspond- ing to this pre-plan. In this section, we briefly review these cost-based heuristics. In the next section, we describe a number of experiments to evaluate the effectiveness of the cost-based heuristics in choosing a good pre-plan and a good plan.

We now describe the assumptions, simplifications, and cost-based heuristics used by the pre-optimizer. Since we do not calculate the cost of the pre-plan, or the correspond- ing plans, we ignore all local processing costs, e.g., scans or joins of locally resident mediator relations. We focus on the impact of WSI assignments for the remote relations on the cost of plans. The impact is assessed based on the latency associated with WSIs; the number of WSI calls including repeated calls for some WSI due to multiple values of at- tribute bindings; and the estimated cardinMity of the results (tuples) returned from WSIs. All of these factors character- ize the remote processing cost of plans. We note that while these factors are not exhaustive, they have a significant im- pact on the cost of the plan. We now describe the heuristics that were considered by the pre-optimizer.

• The pre-optimizer will explore specific evaluation strate- gies such as top-down versus bottom-up evaluation of mediator subgoals. These typically correspond to dif- ferent WSI capabilities. A bottom-up evaluation is usually considered when there is a join between two subgoals, one or both of which are on remote relations, and there are no dependencies among the subgoals. A top-down (nested loop join) implementation is consid- ered when there are dependencies between these sub- goals.

We explain further using an example. Consider a join query over relations R1 . . . Rs. R2 is a remote rela- tion that can be implemented by WSIs $21 and $22 respectively. The query capability of $21 is such that it requires an input binding from relation R3, whereas $22 does not have any required input bindings. Then, we obtain the following two pre-plans: pre-plan 1: { { R1 ,R3 ,R4 ,R5 }, { R2($21) } } U { R3 ~ R2 } pre-plan 2: { { R1,R2(S~2) ,R3,R4,R5 } } U 0

For the second pre-plan with no dependencies, the pre- optimizer will consider the (bottom-up) cost of access- ing remote relation R2 using the metrics associated with $22, and the cost of any join operators involving R2. For simplicity, we assume that the join cost is cal- culated based on the hash join implementation. We

88

Mediator

> [CBR Tool I"

WQ Broker I :

" WebSource i /

IWAN i

, WebSource

e rapper

F i g u r e 1: M e d i a t o r W r a p p e r A r c h i t e c t u r e for WebSources

note that operators such as X JOIN [36, 37] or other adaptive operators [22, 21] would result in improved performance. Note that any join ordering of subgoals is possible for this pre-plan and we assume that the best ordering will be chosen by the optimizer [42].

For the first pre-plan, the pre-optimizer will assume that the dependency corresponding to (R3 ~ R2) will lead to a top down dependent join evaluation [6]. Again, for simplicity, the cost is calculated using the nested loop join implementation, using the metrics associated with $21. Adaptive operators such as those proposed in [22, 21] and WSQ/DSQ [14] would also improve per- formance in the actual plan. To correctly determine the cost, since R2 is the operand on the right subtree, we have to calculate the (output) cardinality of the left subtree tuples (that include relation R3), that join with R2. We assume that the optimizer will choose a good join ordering of subgoals to reduce (minimize) this cardinality [42].

• The pre-optimizer will explore evaluation strategies based upon the choice of atomic versus composed WSIs. Recall the example of ACH DL and the subgoal on Paper. The query capability iorl (atomic WSI) is equivalent to the combined capabilities of ior3 and ior4. The composition of ior3 and ior4 defines a composed WSIs (ior3;ior4). The heuristics associated with this choice are fairly complex. The composed WSI choice typ- ically implements the equivalent query capability of the atomic WSI by using multiple wrapper calls. The number of calls in this example depends on the metrics associated with ior3 since it passes values of bindings to ior4. The cost of this solution increases with the total number of calls to ior4. The choice of composed WSI could also result in additional joins in the me- diator. Depending on the WSI capability, these ad- ditional joins are typically evaluated in a top-down manner, and this, too, has an impact on increasing the cost of the plan. However, the composed WSI (ior3;ior4) provides the ability to filter the results ob- tained from the remote sources, e.g., the results from

the WSI corresponding to ior3 could be filtered by a predicate so as to reduce the number of calls made to ior4. This has the impact of decreasing the cost of using the composed WSI. With sufficient filtering, the composed WSI (ior3;ior4)could be less expensive than the cost of the atomic WSI iorl.

The pre-optimizer will consider the trade-offs of the various metrics characterizing each WS1 that impact the cost of the plan. For simplicity, we explain using WSIs on the same relation, and with similar query ca- pability. Two WSIs may have different input bindings on different attributes, which filter the results. This will change the cardinality of the result, and the se- lectivity factor associated with the WSIs. Consider an iorT with an input binding on both attributes l s tAu tho r and T i t l e which is more restrictive than iorl with a binding on l s tAuthor . This could impact the cost of the plan. The heuristic in this situation will compare the WSIs using a combined measure based on both the end-to-end latency, as well as the result cardinal- ity that results from the particular selectivity of the WSI. Suppose we consider a WSI with low latency. However, its query capability is not selective and the result cardinality is high. The combined measure will trade-off the benefit of the low latency for this WSI with the overhead of the high result cardinality of this WSI, and rank it appropriately. This is discussed later.

While our cost-based heuristics to evaluate the assignment of WSIs in the pre-plan are useful (as shown in our exper- iments), there are limitations to this approach. First, sev- eral operators and techniques for adaptive query optimiza- tion, tailored to overcome delays associated with accessing remote WebSources, have been developed [2, 3, 7, 14, 18, 19, 22, 21, 38, 36]. Typically, these adaptive operators are designed to overcome delays, but obtaining cost formulas for these operators is non trivial. The use of these operators in the actual evaluation plans should improve query execution time. In future work, we will examine the impact of delays and adaptive operators on our cost-based heuristics.

89

The second and more serious limitation is that while each of these cost-based heuristics can be validated independently, a typical scenario of a pre-plan will involve the evaluation of multiple WSIs, where several heuristics must be consid- ered. In our experiments, we have examined some aspects of combining our heuristics. However, combining heuristics is typically difficult.

3.4 Extending the Relational Optimizer to Gen- erate Safe Query Execution Plans

In the optimization phase, the WQO uses an extended randomized optimizer. This optimizer explores a search space of bushy query evaluation plans [20]. The optimizer performs random walks over the search space and picks the plan with the cheapest cost among the plans it has exam- ined. Random walks are performed in stages. Each stage consists of an initial plan generation step followed by one or more plan transformation steps. At the beginning of each stage, a (safe) query evaluation plan (sqep) is ran- domly created in the plan generation step. Then, successive plan transformations are applied to the plan during the plan transformation steps, in order to obtain new plans.

The term safe reflects that both plan generation and plan transformation have to respect the ordering restrictions on mediator subgoals imposed by a pre-plan. Plans whose join orderings violate the ordering imposed by the pre-plan are not safe and must be avoided. Three methods that can be used to ensure only valid join orderings are as follows: 1) generate plans and prune invalid ones; 2) deterministically fix some of the join orderings before the optimizer is acti- vated; 3) generate only valid plans. The first approach is easy to implement. However it becomes infeasible if most of the search space is littered by invalid plans due to complex binding dependencies. The second approach on the other hand destroys the randomized nature of the optimizer. The third method is more complex; however, it has the advan- tage of examining more valid plans in a given time. We take this approach when optimizing queries with binding depen- dencies.

4. EVALUATION OF THE WQO PERFOR- MANCE

In this section, we explore how the cost-based heuristics used in evaluating the pre-plans and their WSI assignment impacts the effectiveness of the WQO. This includes the abil- ity of the WQO to use the cost based heuristics to efficiently navigate the search space; this is related to the optimization time. It also includes the accuracy of the cost-based heuris- tics in identifying low cost plans; this is related to the cost of the plan. We first discuss extensions to the catalog and the WebWrapper cost model. Next, we report on the efficiency of the WQO exploration of the search space, compared to the traditional approach. We then validate the cost-based heuristics.

4.1 Extensions to the Catalog and the Web- Wrapper Cost Model

The mediator catalog (PREDATOR catalog) is extended with WebSource metrics, e.g., the cardinality and the access cost of a wrapper call. The PREDATOR cost model was then extended w.r.t, these metrics. The cost of a query plan in PREDATOR is expressed in terms of various resource

usages, e.g., disk usage, memory usage, etc. We extended this model and introduced a WebWrapper resource and its usage. The WebWrapper usage values are provided by the WebWrapper cost model.

The WebWrapper cost model uses statistics from query feedback to provide estimates of the cost of executing a sub- query in a wrapper. The costs and statistics associated with processing a query at a remote WebSourceincludes the time to return the first tuple, the cost of downloading a page con- taining relevant data, the number of tuples returned, the average size of a page containing the result, and the number of pages that contain the result. The cost model that we develop estimates the above statistics using the Web Pre- diction Tool (WebPW) [15]. The WebPT is a learning tool that considers parameters such as Time of Day and Day, which could affect network and WebSource workload. For sources where the above statistics vary, depending on the values of particular query bindings defined for some WSI, the cost model may store statistics for each binding value. In the absence of statistics for a particular binding value, the cost formula uses the average value of each statistic.

The PREDATOR evaluation engine was extended with several operators, including an external scan operator, a de- pendent join operator [6, 11] and XJOIN [36]. The costs for the operators depend on metrics provided by the WebWrap- per. The cost of the external scan is increased corresponding to the particular WebWrapper usage. The statistics and the WebWrapper usage are also used to determine the cost of a specific implementation of the dependent join operator [6, 11]. For example, the time for the top-down NLDJ, referred to in the next section, is based on the cost of the nested-loop join operator implemented in PREDATOR. We note that adaptive operators [22, 21] would provide improved perfor- mance.

For the experiment, The PREDATOR mediator and the WebWrappers were executed on a Sun Ultra SPARC 1 with 64MB of memory, running Solaris 2.6. The machine was connected via a 10 Mbps Ethernet cable to the domain umi- acs.umd.edu. This domain is connected to its ISP via a 27 Mbps DS3 line. Experiments were conducted using the ACM DL and BLS [29] WebSources.

4.2 Efficiency ofWQO Searchin a Large Space with Multiple WSIs

This experiment compared the efficiency of the two-phase WQO in navigating a large search space, compared to the traditional one phase optimizer. We measured the optimiza- tion time in an indirect manner, since the pre-optimizer and the cost-based heuristics are implemented in Prolog, whereas the relational optimizer used by the WQO and the traditional optimizer is implemented in C + + . Thus, di- rectly comparing running times would not be a fair compar- ison. Instead, we compared the plans that were generated by the two approaches and the (current) cost of the best plan that was generated. Both approaches used a random- ized optimizer configured to traverse the same number of plan transformation steps. We used synthetic data for this comparison.

Consider a 5-way join query where all 5 relations are re- mote. Each was implemented by 3 WSIs. Each WSI had a different input attribute, and we chose the query so that there were no dependencies and any join order was allowed in the plans. This was done to simplify the comparison of

90

the two techniques. The end-to-end latencies and the re- sult cardinality associated with each of the WSIs was also chosen randomly, from a range of three categories of low, medium and high values. Thus, it is clear that there is no straightforward manner in which the traditional optimizer could choose a good WSI assignment and its only choice to obtain an (optimal) plan is to explore all the WSI assign- ments. In this example, there are 3 s WSI assignments.

We now consider the WQO strategy. When choosing pre- plans in this example, since the query capabilities of the multiple WSIs are similar, the WQO strategy can be sim- pfified to only compare WSI assignments using a cost-based heuristic that is based on a measure that combines the end- to-end latency and the result cardinality of each WSI. The heuristic favors both lower latency and lower result cardinal- ity in choosing WSIs. Thus, if the latency of two WSIs were the same, then it would favor the WSI with lower result cardinality, and vice versa. We experimented with several measures to combine the importance of latency versus car- dinality and we report on a measure that gives equal impor- tance to both factors.

Figure 2(a) is a plot of the traditional optimizer output as it exhaustively searches the space of plans; recall that there are 35 WSI assignments to be considered. Each point in the plot represents the cost estimation by the optimizer of the response time (in ms) of the best plan produced by the optimizer for the particular WSI assignment being con- sidered. As can be seen from 2(a), the time for the best plan varies widely, depending on the WSI assignment, since the traditional optimizer must randomly select WSI assign- ments in the plan space. Figure 2(b) represents this output as a histogram. For this example, of the 240+ plans gener- ated, the response time of some 60 plans is less than 2 × 107 ms; there are more than 160 plans in the range of 4 x 107 to 6 × 10 z ms, etc. This indicates that the quafity of the traditional optimizer as it randomly considers WSI assign- ments would be poor, since many of the plans are not good plans. Finally, Figure 2(c) is a plot of the WQO optimizer which chose up to 5 pre-plans using the cost based heuris- tic just described. The cost of best plan generated by the WQO rapidly approaches the cost of the best plan that was generated by the exhaustive search of the traditional opti- mizer. We note that the best plans generated by both the WQO and the traditional optimizer had response times of 1 × 105 ms. Thus, we conclude that the effectiveness of the WQO is high compared to the traditional optimizer. The WQO is able to efficiently search the space of plans and is able to restrict the search to a space of good WSIs which lead to reasonably low cost plans.

4.3 Choice of Top-Down Versus Bottom-Up Evaluation

An ior may require a binding of some attributes in the query and this may be obtained from another subgoal. This would result in an ordering of mediator subgoals so that values are available for the bound attributes. The ordering of subgoals reflects a Sideways Information Passing (SIP) process [35] which is associated with a top-down query eval- uation. Typically, this is implemented as the dependent join (D J) operator [6, 11]. The alternative, provided there is an appropriate WSI without the corresponding attribute bind- ing requirement, is a bottom-up evaluation, when there is no SIP, and a traditional implementation of the join can be

used. A choice of WSIs (with or without the attribute binding

requirement) will lead to different evaluation strategies and will impact the quality of the plan. We briefly compare three implementations. They are the bottom-up hash join (H J) implementation of the join operator, and two top-down eval- uations of the dependent join operator, namely the nested loop (NLDJ) and a hybrid (HDJ) implementation.

The typical bottom-up evaluation of the hash join (H J) implementation may be used when there are no binding de- pendencies. Both external scan operators must extract all the tuples from the WebSource, and build a hash table, be- fore the first output tuple is produced. Each external scan makes one wrapper call. A top-down nested loop NLDJ eval- uation is used when bindings from the outer external scan are passed by SIP to the inner external scan. The binding from the outer external scan acts as a filter and allows the inner external scan to extract only a relevant subset of tu- ples from the WebSource, in a tuple-at-a-time fashion. The number of wrapper calls by the inner external scan equals the cardinality of the outer scan, and as this increases, the performance of the NLDJ degrades. To overcome this draw- back, we consider a hybrid (HDJ) implementation for the dependent join operator. The HDJ accepts a set of bind- ings (tuples) from the outer external scan and passes this set to to the inner external scan and the inner external scan passes this set to the wrapper. The wrapper then iterates over this set of bindings (tuples), and extracts only a subset of the relevant tuples. The tuple-at-a-time behavior which occurred in the mediator in the case of the NLDJ implemen- tation now occurs in the wrapper in the HDJ implementa- tion. The inner external scan makes one wrapper call and this has the potential to improve performance. However, for the HDJ implementation, the mediator must pass a set of all the required bindings to the wrapper and the wrapper must accept a set of bindings. 1 We note that WSQ/DSQ [14] proposes an adaptive variant of the hybrid top-down (HDJ) implementation which exploits potential parallefism of Web access to improve performance.

We present an experimental comparison of these evalu- ation strategies on the BLS WebSource [29]. The trade- offs that are observed are then analyzed using our cost- based heuristics. The mediator relations OES (Occupational Employment Statistics), OEW (Occupational Employment Wages), capabilities (IORs), and query are as follows:

OES (OES Code,OccupationTitle) lots: { } ~ {OESCode, OccupationTitle}

OEW(OESCode,StateName,MeanWage,MedianWage, Employment, MeanAnnualWage) iorg: { } --+ {OESCode,MeanWage, MedianWage,Employment} iorlo :{OESCode} --~ {MeanWage, MedlanWage,Employment}

Select * From OES, OEW Where OEW.OESCode -- OES.OESCode

Using iors and iorg, we can obtain a plan with two exter- nal scan operators on OES and OEW, respectively, and a HJ of OES and OEW over OESCode. Using iors and iorao, we can obtain two plans, with either a HDJ or NLDJ on the same relations, where OESCode is a binding passed from

1We have not as yet completely implemented the HD3 in our mediator so that it can handle bindings on multiple at- tributes.

91

16

14

E

J= 8

~4

x 107 (a) Optimizer Choices +

+ ÷ +

+

÷ ÷ + +

+ * * ++ + ++ ÷÷ + +

, +÷ + .÷ + + +* * ÷ ÷+++

5'O . . . . 1(~0 150 200 250 Plan#

120

100

==

(b) Optimizer Choices 5x 10 e (c) Pre-Optimizer Choices

0.5 1 1.5 2 -1 2 3 Response Time (msec) x 10 a Plan#

F i g u r e 2: C o m p a r i s o n of one phase o p t i m i z e r (a a n d b) a n d W Q O (c)

the outer (OES) to the inner (OEW). All three plans were executed multiple times at random intervals. While we re- port on some random execution, the same trend held in all executions. The dynamic nature of the environment had an impact on all execution times, as will be discussed in the next section.

To study the performance trade-off, we changed the selec- tivity of the outer relation OES by adding a selection condi- tion on attribute OESCode of OES. We chose the selectivity so that after some 700+ tuples were extracted from the Web- Source, after performing the selection, a low selectivity (Is) produced four tuples and a high selectivity (hs) produced two tuples of OES. To explain these small values (two and four tuples), one of the attributes of the inner relation OEW required a costly download operation by the wrapper. Thus, we were able to observe the performance trade-off even with a small number of tuples. We also changed the cardinality of the inner relation OEW. Again, since the wrapper performed a costly download, a low cardinality (lc) corresponded to 5 tuples and a high cardinality (he) corresponded to 10 tuples of OEW.

We consider the behavior of time-to-last-tuple (Figure 3). Comparing experiment groups labeled 1 and 2 (ls/lc and hs/lc), as the selectivity of the outer scan improved, the se- lection condition acted as a filter, and reduced the number of tuples of OES. As a result, the performance of the top-down NLDJ evaluation improved and approached the performance of the usuMly more efficient bottom up HJ. This was because the bottom-up HJ could not benefit from the filter effect on the outer scan, and could not reduce the costly download of OEW tuples from the inner relation. In contrast, the NLDJ could benefit from the filter effect and could reduce the num- ber of OEW tuples that were downloaded. In experiment groups labeled 3 and 4 (ls/hc and hs/hc), we additionally chose a high cardinality of the costly inner scan on OEW, which further increased the tuple extraction cost of the in- ner scan significantly. We varied the selectivity of the outer scan from ls to hs. Again, the performance of the top-down NLDJ improved as we moved towards high selectivity, and in this case, the improvement was more noticeable. This is because the bottom-up HJ evaluation materializes the en- tire costly inner scan (relation) in the mediator. Thus, its performance degrades further, as the cardinality of this rela- tion increases, and thus, the tuple extraction cost increases. In all cases, the HDJ (also a top-down evaluation) was the winner. It could benefit from the filter effect of the outer scan. However, unlike the bottom-up H J, it was not penal- ized by the increased cardinality and cost of tuple extraction

of the inner scan, since it preserves the functionality of the top-down evaluation.

Given some join query and a choice of WSIs the following cost-based heuristics favor the choice of WSI with top-down evaluation:

1. W h e n t h e ( d e p e n d e n t j o i n ) q u e r y h a s a f i l t e r to m a k e t h e selectivity of the outer scan more selective;

2. When the cardlnMity of the inner scan increases;

3. When one or more attributes of the inner scan involves a costly download;

4. When the optimizer is tuned to optimize the time-to-first- tuple.

However, we note that in many cases, the WebSource may not support multiple WSIs for both top-down and bottom- up evaluation. Further, the top-down evaluation is often as- sociated with the choice of composed WSI. Thus, the cost- based heuristics may need to consider the top-down ver- sus botton-up choice, in conjunction with the atomic versus composed choice of WSI, and this is discussed in the nect section.

4.4 WQO Choice of Atomic or Composed WSIs The composed WS[ choice typically implements the capa-

bility of the atomic WSI using multiple wrapper calls and this results in additional joins in the mediator. Depending on the WSI capability, these additional joins are typically evaluated in a top-down manner, and this, too, has an affect on the cost of the plan. Recall that the relation Paper of ACM DL (mediator subgoal MS1) can be implemented either by a single WSI, WSI1 (iorl), or using a composition of WSIa and WSI4 (ior3 and ior4). The WQO has a choice of a single external scan (WSI1), or a top-down dependent join (WSI3 and WSI4). We consider the following mediator query Qi:

Select TiUe, PaperSrc, Coauthor From Paper, CoAuthor Where ]stAuthor="franklin" and CoAuthor.Paperld=Paper.Paperid

If the WQO chooses an atomic WSI, WSI1, then the re- lational optimizer will only generate plan P1 of Figure 4, which has a single dependent join. If the WQO choice is the composed WSIa and WSI4, then the relational optimizer can produce two plans P2 and P3 of Figure 4. Note that there are two external scan operators on the mediator rela- tion Paper in both these plans. The second access to Paper downloaded a PDF file and this wrapper call therefore had

4

92

hs/lc Is/Ic hs/hc Is/he

350

3oo

25o

• ~ 200

0

150

I00

1 2 3 4 Expedment number

HJ NLDJ HDJ

F i g u r e 3: C o m p a r a t i v e b e h a v i o r o f H J a n d D J for q u e r y Q2 w i t h d i f f e r e n t s t a t i s t i c s

ape r C o A u t h o r )

esponding to /

{ / - q ~ P a p e r / D J ~ C o A u t h o r

\ / ~ *costly / ~ f

Pape r C o A u t h o r Pape r Pape r

n. . *costly J r , anscorresponaingto

~ ' ~ choice of WSI2 and WSI3 . . 1 1

F i g u r e 4: P l a n s for q u e r y Q1 in t h e A C M D i g i t a l L i b r a r y WebSource

a higher execution cost. The download time varied from 15 seconds to as much as 70 seconds. The answer cardinality for relation Paper when Author=" Franklin" was ~ 30 tuples.

The time-to-last-tuple behavior for multiple executions of query Q1 (queries were submitted at random intervals), for each of the above plans is in Figure 5(a). This figure indi- cates that in a volatile environment of WebSources, execu- tion times vary significantly. This illustrates that training a traditional cost model could be very difficult. However, there are consistent trends that are observed over all execu- tions. To show that these trends hold, we present a quantile plot of the t ime-to-last-tuple for all executions of Figure 5(a) in Figure 5(b).

As seen in Figure 5(b), plans P1 and P3 are statistically comparable using time-to-last-tuple. These two plans per- form better than plan P2 in all cases. To explain informally, these papers had multiple co-authors. Plan P2 is expensive since it performed a costly download of the paper multi- ple times, once for each co-author. Thus, in this case, the WQO choice of an atomic WSI is a good conservative choice. It is conservative since it avoids the composed choice that could have resulted in the costly plan P2. The time-to-first for all the plans is not shown; however, we observed that a composed WSI in general has less initial delay, and provides

smoother delivery of data. This is important in the Web environment.

Next, we consider the following query Q2 with an addi- tional predicate CoAuthor.Coauthor="Zdonik":

Select Title, PaperSrc, Coauthor From Paper, CoAuthor Where Author=" Franklin" and CoAuthor.Paperld=Paper.Paperld and CoAuthor.Coauthor=" Zdonik"

The shape of the plans generated for this query Q2 are the same as for query Q1 with an additional selection on the Coauthor a t t r ibute of CoAuthor to be performed by the medi- ator. The t ime-to-last-tuple for Q2 is in Figure 6(a). Figure 6(b) has the quantile plots for the same set of executions. Plan P2, which resulted from a WQO choice of a composed implementation is consistently the best choice. To explain informally, plan P2 first filtered for papers co-authored by "Zdonik" before performing a costly download of the paper. Thus, the WQO choice of an atomic WSI would be a poor conservative choice since it has the penalty that it could not produce the less costly plan P2.

To derive a WQO heuristic to choose the atomic or com- posed WSI, we compare the costs of the two implementa- tions. For the atomic WSI, this is the cost of an external

93

(a) 3 5 0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3ooc .................................................... ~ ........................ I-~~-~]. :! i i

o : :

250~ ............................... :~ ................ / .:.,. ............. i ........................... ! • : : .

@ " ~ : o ? : i i i E 2 0 0 1 . . . . . . . . . . . . : . . . . . . . . . . . . . : .~ . . . : . . . . : . . . . . . . . . . . . : : . . . : . . . . . . . . £ > . ~ . . . . . . . i . . . . : . .~ . . . . ! i- :: ::0 i : i : :~ :: i !: i

~50, ........ /~: ....... ! / i o + : ...... :.::~/ i ~ i : ....... :: 0 : ~: . :: : b o : " : :

I GO0 . . . . . . . . ~ ' ' ' ~ I f " ~ 0 ~ . . . . . . . . . . . . . :: . . . . . . . . ' ~ " ~ . . . . . . . . . . . ~ . . . . . . . . . . . '~ i ~ . ~ . . . . . .

5OOr.~ .~.~..:,.,. ~ ~..:~.>~:.. . . . . . . + .................. , ........

0 5 1 0 1 5 2 0 2 5 3 0 3 5

Q u e r y #

(b)

1OOr .......... i ~.....!... • .ft.=? ............ i ............ :: ........... i,~ ............

,0[ ............. :.:j~.' ........ ~: ............. ) ....... :)°:/~ .............

~0~ .......... /.:'; ............. i ............ ~:"i ............. i ............. ',

~0~ ......... /i..: ........ i ............. 0oO:::0 ............ ; ............ ~ ............. : ® I $~: ! : : . .o i ::

60k ....... , F , ~ ......... i .......... g ' : . . . . . . . . . . . . . i . . . . . . . . . . . . . i . . . . . . . . . . . . i . . . . . . . . . . . . ::

::I Ib . . . . . ,o1_ I : !_ f ~ o/ ~ i i i i i ~ i

0 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 3 5 0 0

R e s p o n s e T ime (sec)

F i g u r e 5: T i m e - t o - l a s t - t u p l e (a) a n d q u a n t i l e p lo t of t i m e - t o - l a s t (b) for q u e r y Q1 (a)

9 0 0 r . . . . . . . . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . o . . . . . . . .

~ o o . . . . . . . . . . . . i . . . . . . . . . . . . . i . . t l ....... ! ......... ~ . ' ~ t - : ~ i [ i . . . ; ........ ~, ~ 7 0 0 . . . . . . . . . . . . . i . . . - . . t . . - ! . */~', . . . . . . . i . . . . . . . . . . ' . | . . . . . . . . . . . . i . . . . . . . . . . . . . i . . i i . . . . . . . !

~ 6 0 0 . . . . . . . . . . . . . : : P " t ~ i : 1 [ :: . . . . . . . . . I / ' [ . . . . . . . . . . [~ . . . . . . . . . . . i , ! ~ . . . . . . . i

=4oo..I.....~.~:...l.illktl....t.: .... ~1 if! , , , , ~, ~ ~ ~, ' I

~. ' ~Xi / ~i:~ ~', :,//-~i ~;;;~i/~, :,,~i ,,Y, = 300 . . . . . . . . . ~ A : -/' " ' : ~ ' - \ > ~ ....... / i ~ ' R A ; " ~ ' ~ .i, : l " i @ ~ • • ! b i

20¢ ~ ~ , : ' ~ o : ° q l ~ ! o o 9. i ? !

1 0 ~ . . . . . . . r . . : ~ , 9 o : .. : i ~ . . . ~ . . . i . . . = :~ o . . . . . . . . : i . o . . . . . . . . . ~ . . . . . . . °o'C~ : ¢ 0 " ° ' ° ~ ] ~ 0 0 i "o ° o 1 6 ~ : o ]

: : : : : : :

5 1 0 1 5 2 0 2 5 3 0 3 5

Q u e r y #

1 0 0

9 0

8C

7C

6G

5C 2

30

20

10

(b ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ' . . . . . . . . . .~ . '_o2 ' : : - : = = ~ :

. . . . o . . . . . . . . . . . . . . . . . ! - - - i

. . . . . . . . . . . t c~... .. .. ' : . . . . . . . . . . . . . . . . . . . . ~ : . . . . . . . !::

....... oog ...................... - . X ~ / . ~ _ " ........ ~ .................................. ~ [ :~ f g ~ i o ~ • : :

o ~ : : :

o o ~ . . . . . . . . . . . . . . ~ " :: : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i : ~ o . . . . . . . . . . . ~ . . . . . . . . ~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . :

o ~ : : o . . . . . . . . . . . . . ~ . . . . . . . . . : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . : o y : : o • • : ~ i i :

5 0 0 1 0 0 0 1 5 0 0

R e s p o n s e T i m e ( s e c )

F i g u r e 6: T i m e - t o - l a s t - t u p l e (a) a n d q u a n t i l e p lo t of t i m e - t o - l a s t (b) for q u e r y Q2

scan, and for the composed WSI choice, this is the cost of a top-down evaluation. In the following table we summarize the WQO choice of best cost plan, for twenty random eval- uations of queries Q1 and Q2. In the majority of the cases, the WQO chose the optimal plan of P1 for query Q1 and P2 for query Q2. The WQO chose the sub-optimal plan P3 in a few cases for the two queries. The WQO rarely chose the worst plan (P2 for Q1 and P1 for Q2).

Q1 16 0 (bad) Q2 0 (bad) 18 2

4.5 Further Analysis of Optimizer Behavior The two queries that were analyzed in the previous section

identified several factors that may impact the cost of the plan. For query Q1, the binding attribute PaperId from the outer scan is not a key attribute that can be used to probe the tuples of the relation CoAuthor in the inner scan, and perhaps reduce join cardinality. Thus, plan P2 was costly, since the paper was downloaded multiple times, as dictated by the the join cardinality of Paper and CoAuthor. The WebWrapper cost model indicates that downloading a paper is an expensive operation. In query Q2 however, there is a selection on attribute Coauthor of relation CoAuthor, and the selectivity is high. Thus, in plan P2, performing the selection on the attribute Coauthor after the join between Paper and CoAuthor will reduce the number of papers that are downloaded. Finally, the choice of atomic or composed WSIs is also impacted by the need to typically select a top down NLDJ with the composed choice.

These factors, join cardinality, selectivity, and the choice of a top-down evaluation, that may affect the choice of atomic or composed WSIs, were then validated on more complex queries. We performed an analysis of 3-way, 4- way and 5-way join queries, using the optimizer and the WebWrapper cost model and synthetic values for cost and cardinality for the WSIs. For all these queries, either atomic or composed WSIs could be chosen for one of the subgoals Si corresponding to relation Ri. There is a selection on one of the attributes of relation Ri, and the selectivity can be changed. There is a join between Ri and relation Rj, and the join cardinality of the result is also changed.

Figure 7 reports on the time-to-last for the 3-way join query. "Atomic" is the cost of the best plan produced by the optimizer, when the atomic WSI was presented to it. "Com- posed" is the cost of the best plan when the composed WSIs were presented. In Figure 7(a), the maximum join cardi- nality for the join of (R~, Rj) is set to 100 tuples. Reading from left to right, the selectivity on Ri decreases. Thus more tuples of Ri are selected, progressing from 1 to 10 to 100 tu- ples. As can be seen, when the join cardinality of the output and high selectivity combine their effect so that less tuples are accessed by the mediator, then the composed WSI is the winner (labeled high). However, as the selectivity decreases, or the join cardinality of the output is high, the advantage of using the composed WSI disappears (labeled low). There could also be an overhead from using the composed WSI. Figure 7(b) corresponds to a (maximum) join cardinality of 1000 tuples. The selectivity decreases from left to right, cor- responding to 1,10,100 and 1000 tuples of Ri. We see the same behavior as before. We note that the crossover point

94

(a) 2~^~

3-way join, cardinality = 1 O0

high mea low Selectivity

(b) x 1 0 3 w a y j ° in ' cardinali ty = 1000 2.5

Composed ~ 2 Composed

~ .

0

°YHI_I _ I'~gh low

Selectivity

F i g u r e 7: T i m e - t o - l a s t for a 3-way j o i n q u e r y w i t h v a r y i n g c a r d i n a l i t y

where the advantage of the composed WSIs disappears ap- pears to be controlled by the cost of the wrapper usage, the cost of other parts of the query, and the choice of top-down or bottom-up evaluation chosen for the joins in the query. Results on a 4-way and 5-way join show similar trends.

The WQO should pursue an aggressive strategy of com- posed WSIs, when the cost-based heuristics recommend that this will reduce the number of tuples dehvered to the media- tor. It should pursue a conservative strategy of an atomic WI when there are costly wrapper costs. The following heuristic is recommended:

• WQO choice should be composed when

- The WSIs are associated with a post-selection on a re- lation with high selectivity.

- Some binding attribute associated with some WSI is a key attribute that can be used to probe the inner scan of some join in the query.

• WQO choice should be atomic when

- The heuristic for the composed choice is not satisfied.

- The relations and joins associated with the WSIs have high cardinality for the relations and high join cardinality.

Finally, we further illustrate the difficulty faced by a tra- ditional optimizer as well our heuristics in choosing WSIs.

5. COMPARISON WITH RELATED WORK AND CONCLUSIONS

Garhc [27] uses a sophisticated cost based optimizer to consider both costs and capability and they provide a robust model for considering diverse costs. They use a traditional one phase optimizer so in order to fully explore the space of plans, the optimizer would have to enumerate all the plans. Since wrappers determine the capability of the query to the remote source, the Garhc optimizer may not consider all possible capabilities of the remote source. Our examples of atomic and composed choices indicate that such a choice is important during optimization. They also do not consider dynamic costs for accessing WebSources.

Several solutions have been proposed to estimate het- ,erogeneous access costs. [8, 13] assumes that cahbration databases can be constructed on remote sources, i.e. they accept updates. The DISCO project [33] assumes that the

wrapper for each source provides a description of the avail- able physical operators and their corresponding costs. The HERMES mediator [1] and the WebPT [44] is appropriate to modehng the costs of accessing WebSources, since their models use query feedback.

Finally, several operators and techniques for adaptive query optimization, tailored to overcome delays associated with accessing remote WebSources, have been developed [2, 3, 7, 18, 19, 22, 21, 36, 38]. While our cost-based heuristics have not extensively considered such adaptive techniques, it is clear that adaptive operators will have an impact. This is an area for future research.

To summarize, we presented a two-phase optimization ap- proach in the WQO. We presented several cost-based heuris- tics to evaluate the choice of WSIs assignments in the pre- plan. The effectiveness of these heuristics were measured using experiments. Our evaluation of the WQO indicates that in wide area environments, statistics on past query ex- ecutions should be used in choosing a good plan. However, using a traditional cost based optimizer that tries to charac- terize the execution of some plans as always good or always poor would be problematic. We validated that our heuristics can differentiate when the choice of WSIs leads to typically good or bad plans. While each of these cost-based heuris- tics can be validated independently, a typical scenario of a pre-plan will involve the evaluation of multiple WSIs, where several heuristics must be considered. In our experiments, we have examined some aspects of combining our heuristics. However, combining heuristics is typically difficult. Our use of an optimizer to generate the best plan, based on the ac- tual evaluation cost of the plan will overcome this hmitation to some extent. In future work, we will explore extending the cost model of the pre-optimizer to support a more ex- haustive exploration of the impact of WSI choices by the pre-optimizer on the cost of plans.

6. REFERENCES [1] S. Adali, K.S. Candan, Y. Papakonstantinou, and V.S.

Subrahmarfian. Query caching and optimization in distributed mediator systems. Proceedings of the A CM SIGMOD Conference, 1996.

[2] L. Amsaleg, M. Franklin, A. Tomasic, and T. Urhan. Dynamic query operator scheduling for wide-area remote access. Journal of Distributed and Parallel Databases, 6(3), 1998.

95

[3] It. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. Proceedings of the A CM Sigmod Conference, 2000.

[4] L. Bright, J.-R. Gruser, L. Raschid, and M.E. Vidal. A wrapper generation toolkit to specify and construct wrappers for webaccesible data sources (websources). Journal of Computer Systems Special Issue on Semantics in the W W W , 14(2):83-98, 1999.

[5] L. Bright and L. Raschid. Cost modeling of wrappers for web accesible data sources (websources). http://www.umiacs.umd.edu/labs/CLIP/DARPA/ww97.html. (under review), 1998.

[6] S. Chaudhuri and K. Shim. Query optimization in the presence of forcing fucntions. Proceedings of the VLDB Conference, 1993.

[7] J. Chen, D. DeWitt , F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. Proceedings of the A G M SIGMOD Conference, 2000.

[8] W. Du, It. Krishnamurthy, and M.C. Shan. Query optimizat ion in a heterogeneous dbms. Proceedings of the VLDB Conference, 1992.

[9] B. Eckman, A. Kosky, and L. Laroco. Extending tradit ional query-based integrat ion approaches for functional characterization of post-genomic data. BioInformatics, 17(7):587---601, 2001.

[10] B. Eckman, Z. Lacroix, and L. Raschid. Optimized seamless integrat ion of biomolecular data. Proceedings of the IEEE International Symposium on Bio-Informaties and Biomedical Engineering (BIBE), 2001.

[11] D. Florescu, A. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence of limited access pat terns. Proceedings of the A C M SIGMOD Conference, 1999.

[12] H. Garcia-Molina, Labio W., and Yerneni It. Capability-sensitive query processing on interact sources. Proceedings of the International Conference on Database Engineering, 1999.

[13] G. Gardarin, B. Finance, and P. Fankhauser. IRO-DB: A Distributed System Federating Object and Relational Databases, In Object-Oriented Multidatabase Systems : A solution for Advanced Applications, Bukhres, O. and Elmagarmid, A. Prentice Hall, 1996.

[14] R. Goldman and J. Widom. Wsq/dsq: A practical approach for combined querying of databases and the web. Proceedings of the A CM SIGMOD Conference, pages 285-296, 2000.

[15] J.-R. Grnser, L. Itaschid, V. Zadorozhny, and T. Zhan. Learning response time for websources using query feedback and application in query optimization. VLDB Journal, Special Issue on Databases and the Web, Mendelzon, A. and Atzeni, P., editors., 9(1):18-37, 2000.

[16] L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing queries across diverse data sources. Proceedings of the VLDB Conference, 1997.

[17] L. Haas, P. Schwarz, P. Kodali, E. Kotlar, J. Rice, and W. Swope. Discoverylink: A system for integrat ing life sciences data. IBM Systems Journal, 40(2), 2001.

[18] P. Haas and J. Hellerstein. Ripple joins for online aggregation. Proceedings of the A C M Sigmod Conference, 1999.

[19] J. Hellerstein et al. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 2000.

[20] Y. Ioanidis and Y. Kang. Randomized Mgorithms for optimizing large join queries. Proceedings of the A CM SIGMOD Conference, 1990.

[21] Z. Ives et al. An adaptive query execution system for da ta integration. Proceedings of the A C M SIGMOD Conference, 1999.

[22] Z. Ives, A. Levy, D. Weld, D. Florescu, and M. Friedman. Adaptive query processing for internet applications. IEEE Data Engineering Bulletin, 23(2):19-26, 2000.

[23] A.Y. Levy et al. Querying heterogeneous information sources using source descriptions. Proceedings of VLDB, 1996.

[24] C. Li and E. Chang. Query planning with limited source capabihties. Proceedings of ICDE, 2000.

[25] C. Li and E. Chang. On answering queries in the presence of fimited access pat terns . Proceedings of the International Conference on Database Theory, 2001.

[26] ACM Digital Library. http://www.acm, org/dl/Search.html. [27] L.Haas M. Tork Roth, F. Ozcan. Cost models do mat ter :

Providing cost information for diverse da ta sources in a federated system. Proceedings of the VLDB Conference, 1999.

[28] H. Naacke, G. Gardarin, and A. Tomasic. Leveraging mediator cost models with heterogeneous da ta sources. Proceedings of the International Conference on Data Engineering, 1998.

[29] Bureau of Labor Statistics. http://stats.bls.gov. [30] Y. Papakonstant inou, A. Gupta , and L. Haas.

Capabilit ies-based query rewriting in mediator systems. Proceedings of the International Conference on Parallel and Distributed Information Systems, 1996.

[31] It. Ramakr i shnan P.Seshadri, M.Livny. The case for enhanced abst ract da ta types. Proceedings of the VLDB Conference, 1997.

[32] P. Selinger, M. Astrahan, D. Chamberlln, R. Lorie, and T. Price. Access pa th selection in a relational database management system. 1979.

[33] A. Tomasic et al. Scaling heterogeneous databases and the design of disco. Proceedings of the International Conference on Distributed Computing Systems, 1996.

[34] J. Ullman. Principles of DataBase and Knowledge-Base Systems, volume I. Computer Science Press, 1988.

[35] J. Ullman. Principles of DataBase and Knowledge-Base Systems, volume II. Computer Science Press, 1989.

[36] T. Urhan and M. Franklin. Xjoin: A reactively-scheduled pipel inedjoin operator. IEEE Data Engineering Bulletin, 23(2):27-33, 2000.

[37] T. Urhan and M. Franklin. Dynamic pipeline scheduling for improving interactive performance of online queries. Proceedings of the VLDB Conference, 2001.

[38] T. Urhan, M. Franklin, and L. Amsaleg. Cost-based query scrambling for initial delays. Proceedings of the A CM Sigmod Conference, 1998.

[39] V. Vassalos and Y. Papakonstant inou. Describing and using query capabilities of heterogeneos sources. In Proceedings of the VLDB Conference, 1997.

[40] V. Vassalos and Y. Papakonstant inou. Using knowledge of redundancy for query optimizat ion in mediators. Proceedings of the A A A I Symposium on A I and Data Integration, 1998.

[41] M.E. Vidal. A Mediator for Scaling up to Multiple WebSources. PhD thesis, University Simon Bolivar, 2000.

[42] M.E. Vidal, L. Raschid, and V. Zadorozhny. Decision support model for pre-plans. In preparation, 2001.

[43] R. Yerneni, Chen Li, J. Ullman, and H. Garcia-Molina. Optimizing large join queries in mediat ion systems. Proceedings of the International Conference on Database Theory, 1999.

[44] V. Zadorozhny, L. Raschid, T. Zhan, and L Bright. Validating a cost model for wide area applications. Proceedings of the International Conference on Cooperating Information Systems, 2001.

96