rule induction for semantic query optimization · late multidatabase query access plans and showed...

12
Rule Induction for SemanticQuery Optimization Chun-Nan Hsu and Craig A. Knoblock Information Sciences Institute Department of Computer Science University of SouthernCalifornia 4676 Admiralty Way, Marina del Roy, CA90292 {chunnan,knoblock} @isi.edu Abstract Semantic query optimization can dramatically speed up database query answering by knowledge intensive reformulation. But the problem of how to learn re- quired semantic rules has not previously been solved. This paper describes an approach using an inductive learning algorithm to solve the problem. In our ap- proach, learning is triggered by user queries and then the system induces semantic rules from the informa- tion in databases. The inductive learning algorithm used in this approachcan select an appropriate set of relevant attributes from a potentially huge number of attributes in real-world databases. Experimental results demonstrate that this approach can learn suf- ficient background knowledge to reformulate queries and provide a 57 percent average performance im- proverrtent. 1 INTRODUCTION Speeding up a system’s performance is one of the major goals of machine learning. Explanation-based learning is typically used for speedup learning, while applications of inductive learning are usually lim- ited to data classifiers. In this paper, wepresent an approach inwhichinductively learned knowledge is used for semantic query optimization to speed up query answering for data/knowledge-based systems. The principle of semantic query optimization [9] is to use semantic rules, such as all T~nisian seaports have railroad access, to reformulate a query into a less expensive but equivalent query, so as to reduce the query evaluation cost. For example, suppose we have a query to find all Tunisian seaports with rail- road acces~ and ~,000,000 ft 3 of storage space. From the rule given above, we can reformulate the query so that there is no need to check the railroad access of seaports, which may save some execution time. Many algorithms for semantic query optimization have been developed [8, 9, 15, 17]. Average speedup ratios from 20 to 40 percent using hand-coded knowl- edge are reported in the literature. This approach to query optimization has gained increasing atten- tion recently because it is applicable to almost all existing data/knowledge-base systems. This feature makes it particularly suitable for intelligent infor- mation servers connected to various types of remote information sources. A critical issue of semantic query optimization is how to encode useful background knowledge for re- formulation. Most of the previous work in seman- tic query optimization in the database community assume that the knowledge is given. [9] proposed using semantic integrity constraints for reformula- tion to address the knowledge acquisition problem. Examples of semantic integrity constraints are The salary of an employee is always less than his man- ager’s, and Only female patients can be pregnant. However, the integrity rules do not reflect proper- ties of the contents of databases, such as related size of conceptual units, cardinality and distribution of attribute values. These properties determine the execution cost of a query. Moreover, integrity con- straints rarely match query usage patterns. It is dif- ficult to manually encode semantic rules that both reflect cost factors and match query usage patterns. The approach presented in this paper uses exam- ple queries to trigger the learning in order to match query usage patterns and uses an inductive learn- ing algorithm to derive rules that reflect the actual contents of databases. An important feature of our learning approach is that the inductive algorithm learns from complex real-world information sources. In these information sources, data objects are clustered into conceptual KDD-94 AAAI.94 Workshop on Knowledge Discovery in Databases Page 311 From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

Upload: others

Post on 20-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

Rule Induction for Semantic Query Optimization

Chun-Nan Hsu and Craig A. KnoblockInformation Sciences Institute

Department of Computer ScienceUniversity of Southern California

4676 Admiralty Way, Marina del Roy, CA 90292{chunnan,knoblock} @isi.edu

Abstract

Semantic query optimization can dramatically speedup database query answering by knowledge intensivereformulation. But the problem of how to learn re-quired semantic rules has not previously been solved.This paper describes an approach using an inductivelearning algorithm to solve the problem. In our ap-proach, learning is triggered by user queries and thenthe system induces semantic rules from the informa-tion in databases. The inductive learning algorithmused in this approach can select an appropriate set ofrelevant attributes from a potentially huge numberof attributes in real-world databases. Experimentalresults demonstrate that this approach can learn suf-ficient background knowledge to reformulate queriesand provide a 57 percent average performance im-proverrtent.

1 INTRODUCTION

Speeding up a system’s performance is one of themajor goals of machine learning. Explanation-basedlearning is typically used for speedup learning, whileapplications of inductive learning are usually lim-ited to data classifiers. In this paper, we present anapproach inwhichinductively learned knowledge isused for semantic query optimization to speed upquery answering for data/knowledge-based systems.

The principle of semantic query optimization [9] isto use semantic rules, such as all T~nisian seaportshave railroad access, to reformulate a query into aless expensive but equivalent query, so as to reducethe query evaluation cost. For example, suppose wehave a query to find all Tunisian seaports with rail-road acces~ and ~,000,000 ft 3 of storage space. Fromthe rule given above, we can reformulate the queryso that there is no need to check the railroad access

of seaports, which may save some execution time.Many algorithms for semantic query optimizationhave been developed [8, 9, 15, 17]. Average speedupratios from 20 to 40 percent using hand-coded knowl-edge are reported in the literature. This approachto query optimization has gained increasing atten-tion recently because it is applicable to almost allexisting data/knowledge-base systems. This featuremakes it particularly suitable for intelligent infor-mation servers connected to various types of remoteinformation sources.

A critical issue of semantic query optimization ishow to encode useful background knowledge for re-formulation. Most of the previous work in seman-tic query optimization in the database communityassume that the knowledge is given. [9] proposedusing semantic integrity constraints for reformula-tion to address the knowledge acquisition problem.Examples of semantic integrity constraints are Thesalary of an employee is always less than his man-ager’s, and Only female patients can be pregnant.However, the integrity rules do not reflect proper-ties of the contents of databases, such as relatedsize of conceptual units, cardinality and distributionof attribute values. These properties determine theexecution cost of a query. Moreover, integrity con-straints rarely match query usage patterns. It is dif-ficult to manually encode semantic rules that bothreflect cost factors and match query usage patterns.The approach presented in this paper uses exam-ple queries to trigger the learning in order to matchquery usage patterns and uses an inductive learn-ing algorithm to derive rules that reflect the actualcontents of databases.

An important feature of our learning approach isthat the inductive algorithm learns from complexreal-world information sources. In these informationsources, data objects are clustered into conceptual

KDD-94 AAAI.94 Workshop on Knowledge Discovery in Databases Page 311

From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.

Page 2: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

units. For example, the conceptual unit of a rela-tional databases is a relation, or simply a table. Forobject-based databases, it is a class. In description-logic knowledge bases [3], data instances are clus-tered into concepts. Each conceptual unit has at-tributes that describe the relevant features. Mostinductive learning systems, such as ID3, assume thatrelevant attributes are given. Consider a databasewith three relations: car, person, and company. Wemight want to characterize a class of persons by thecompany they work for, or by the cars they drive, orby the manufacturers of their cars. In these cases, weneed attributes from different relations to describe adesired class of objects. Previous studies [1, 13] haveshown that the choice of instance language bias (i.e.,selecting an appropriate set of attributes) is criticalfor the performance of an inductive learning system.To address this problem, we propose an inductivelearning algorithm that can select attributes from(h~gerent relations automatically.

The remainder of this paper is organized as fol-lows. The next section illustrates the problemof semantic query optimization for data/knowledgebases. Section 3 presents an overview of the learningapproach. Section 4 describes our inductive learn-ing algorithm for structural data/knowledge bases.Section 5 shows the experimental results of usinglearned knowledge in reformulation. Section 6 sur-veys related work. Section 7 reviews the contribu-tions of the paper and describes some future work.

where the relation geoloc stores data about geo-graphic locations, and the attribute glc_cd is a ge-ographic location code.

The queries we are considering here are Horn-clause queries. A query always begins with the pred-icate answer and has the desired information as ar-gument variables. For example,QI: answer(?na~e) :-

geoloc (?name, ?glc ed, "Malta", _, -),seaport (_, ?gl’c..cd, ?storage, _, _, _),?storage > 1500000.

retrieves all geographical location names in Malta.There are two types of literals. The first type cor-responds to a relation stored in a database. Thesecond type consists of built-in predicates, such as> and member. Sometimes they are referred to aseztensional and intentional relations, respectively(see [20]). We do not consider negative literals andrecursion in this paper.

Semantic rules for query optimization are also ex-pressed in terms of Horn-clause rules. Semantic rulesmust be consistent with the contents of a database.To clearly distinguish a rule from a query, we showqueries using the Prolog syntax and semantic rulesin a standard logic notation. A set of example rulesare shown as follows:R1 : geoloc (_, _, "Malta", ?latitude ,-)

?latitude > 35.89.

R2: geoloc (_, ?glc_cd, "Malta" ,_,_)::~ seaport (_, ?glc~cd,_,_,_,_).

SEMANTIC QUERYOPTIMIZATION

R3: seaport(_,?glc-cd,?storage ...... ) geoloc (_, ?glc_cd, "Malta", _, _)

?storage > 2000000.

Semantic query optimization is applicable to vari-ous types of database and knowledge base. Never-t~ieless, we chose the relational model to describe.our approach because it is widely used in practice.

¯ The approach can be easily extended to other datamodels. In this paper, a database consists of a setof primitive relations. A relation is then a set of in-stances. Each instance is a vector of attribute values.The number of attributes is fixed for all instances ina relation. The values of attributes can be either anumber or a symbol, but with a fixed type. Belowis an example database with two relations and theirattributes:

Rule 1ll states that the latitude of a Maltese ge-ographic location is greater than or equal to 35.89.112 states that all Maltese geographic locations in thedatabase are seaports. R3 states that all Maltese sea-ports have storage capacity greater than 2,000,000ft s. Based on these rules, we can infer five equiva-lent queries of ql. Three of them are:Q21 : answer(?name) :-

geoloc(?name, ?glc-cd, "Malta", _, _),seaport (_, ?glc_cd,_,_,_,_).

Q22: ansver(?n~ae) geoloc(?name ,_, "Malta" , _, _)

geoloc (name, glc_cd, country, latitude,longitude) ~23 : answer (?name) :-seaport (name, glc_cd, storage, silo, crane, rail), geoloc (?name, _, "Malta", ?latitude, _),

?latitude < 35.89.

Page 312 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

Page 3: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

021 is deduced from 01 and R3. This is an exam-ple of constraint deletion reformulation. From It2,we can delete one more literal on seaport and in-fer that Q22 is also equivalent to {}1. In addition todeleting constraints, we can also add constraints toa query based on rules. For example, we can adda constraint on ?latitude to {}22 from It1, and theresulting query {}23 is still equivalent to {}1. Some-times, the system can infer that a query is unsat-isfiahle because it contradicts a rule (or a chain ofrules). It is also possible for the system to infer theanswer directly from the rules. In both cases, there isno need to access the database to answer the query,and we can achieve nearly 100 percent savings.

Now that the system can reformulate a query intoequivalent queries based on the semantic rules, thenext problem is how to select the equivalent querywith the lowest cost. The shortest equivalent queryis not always the least expensive. The exact execu-tion cost of a query depends on the physical imple-mentation and the contents of the data/knowledgebases. However, we can usually estimate an ap-proximate cost fromthe database schema and rela-tion sizes. In our example, assume that the relationgeoloc is very large and is sorted only on glc_cd,and assume thatthe relation seaport is small. Exe-cuting the shortest query {}22 requires scanning theentire set of geolo¢ relations and is thus even moreexpensive than executing the query {}1. The cost

to evaluate q21 will be less expensive than I~1 andother ~quivalent queries because a redundant con-straint on ?storage is deleted, and the system canstill use the sorted attribute g3.c_cd to locate theanswers efficiently. Therefore, the system will select{}21.

Although the number of equivalent queries growscombinatorially with the number of applicable rules,semantic query optimization can be computed with-out explicitly searching this huge space. We havedeveloped an efficient reformulation algorithm thatis polynomial in terms of the number of applicablerules. We also extended this algorithm to reformu-late multidatabase query access plans and showedthat the reformulations produce substantial perfor-mance improvements [8].

We conclude this section with the following obser-vations on semantic query optimization.

1..Semantic query optimization can reduce queryexecution cost substantially.

2. Semantic query optimization is not a tauto-logical transformation from the given query; itrequires nontrivial , domain-speciflc backgroundknowledge. Learning useful background knowl-edge is critical.

3. Since the execution cost of a query is dependenton the properties of the contents of informationsources being queried, the utility of a semanticrule is also dependent on these properties.

4. The overhead of reformulation is determined bythe number of applicable rules. Therefore, theutility problem [10] is likely to arise and thelearning must be selective.

3 OVERVIEW OF THELEARNING APPROACH

Thissection presents an overview of our learning ap-proach to address the knowledge acquisition problemof semantic query optimization. The key idea of ourlearning approach is that we view a query as a logicaldescription (conjunction of constraints) of the an-swer, which is a set of instances satisfying the query.With an appropriate bias, an inductive learner canderive an equivalent query that is less expensive toevaluate than the original. Based on this idea, thelearning is triggered by an example query that isexpensive to evaluate. The system then inductivelyconstructs a less expensive equivalent query from thedata in the databases. Once this equivalent query islearned, the system compares the input query andconstructed query, and infers a set of semantic rulesfor future use.

Figure I illustrates a simple scenario of this learn-ing approach. An example query is given to a smalldatabase table with 3 instances. Evaluating thisquery will return an instance, which is marked witha "+" sign. Conceptually, instances in this tableare labeled by the query as positive (answers) or neg-ative (non-answers). We can use the inductive learn-ing algorithm to generate an equivalent alternativequery with appropriate biases so that the generatedquery is less expensive to evaluate. The generatedalternative query should be satisfied by all answerinstances and none of the others. This guaranteesthe equivalence of the two queries with regard to thecurrent status of the data/knowledge base. Supposethat in this simple database, a short query is al-ways less expensive to execute. The system will biasthe learning in favor of the shortest description andinductively learn an alternative query (tl = ’Z’ ).

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 313

Page 4: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

IA1 A2 A3A 1 2B 1 2Z 0 2

Database

I ~ J A1 "A2 #,3 Znduccivl Concept lVormac4ozi~

i ~ A 1 2 f Alternative OUCh/: "N! i, B 1 2 ’ . ~ ~. AI.7 ~ /

I NonAnswer - Negative Instances # #

/Olmiat£onLl:lLiat£oa I (AI = Z) ~> (A2 < 0)

:] (AI=Z) ~* _(A3ffi2)Rules to be Learned

T~r~ng Example

Figure 1: An Simplified Example Learning Scenario

The inductive learning algorithm will be discussedin the next section.

The equivalence of the alternative query and theexample query provides a training ezamp/e of re-formulation. In other words, this training exampleshows which equivalent query an input query shouldbe reformulated into. The operationalization com-ponent will deduce a set of rules from the trainingexample. This process consists of two stages. In thefirst stage, the system uses a logical inference pro-cedure to transform the training example into therequired syntax (Horn clauses). This syntax is de-signed~so that the query reformulation can be com-puted efficiently. The equivalence between the twoqueries is converted to two implication rules:

(1)(A2 <_ 0) A (A3 = 2) ==# (AZ

(2)(A1 = ’Z’) ~ (A2 _< 0) A (A3 = 2)

Rule (2) can be further expanded to satisfy our syn-tax criterion:

(3)(A1 = ’Z’) ==# (A2

(4)(.41 ’Z’) ====~ (A3 = 2)

After the transformation, we have proposed rules(1), (3), and (4) that satisfy our syntax criterion.In the second stage, the system tries to compressthe antecedents of rules to reduce their match costs.In our example, rules (3) and (4) contain only literal as antecedent, so no further compression isnecessary. These rules are then returned immedi-ately and learned by the system.

If the proposed rule has more than one antecedentliteral, such as rule (1), then the system can use thegreedy minimum set cover algorithm [6] to eliminateunnecessary constraints. The problem of minimumset cover is to find a subset from a given collectionof sets such that the union of the sets in the subsetis equal to the union of all sets. We rewrite rule (1)as

(5)-,(A1 = ’Z’) =:~ -,(A2 _< 0) V -~(A3 = 2).

The problem of compressing rule (1) is thus reducedto the following: given a collection of sets of data in-stances that satisfy -~(A2 _< 0) V --(t3 = 2),findthe minimum number of sets that cover the set ofdata instances that satisfy --(tl - ’Z’). Since theresulting minimum set that covers -~(tl - ’Z’ ) -~(t2 < 0), we can eliminate "~(t3 = 2) from (5) and negate both sides to form the rule

(A2 _< o) (AZ = ’z’).

4 LEARNING ALTERNATIVEQUERIES

The scenario shown in Figure 1 is a simplified ex-ample where the database consists of only one ta-ble. However, real-world databases and knowledgebases usually decompose their application domaininto multiple conceptual units. One could try tocombine every conceptual unit that could be relevantinto a large table, then apply the learning system fortabular databases directly. However, learning froma large table is too expensive computationally. Such

Page 314 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94

Page 5: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

an approach will not work unless a small numberof relevant attributes are correctly identified beforelearning.

In this section, we discuss inductive learning forHorn-clause queries from a database with multiplerelations. Our learning problem is to find an al-ternative query to characterize a class of instancesdefined in a relation. In standard machine learningterms, this subset of instances are labeled as positiveexamples, and the others are negative examples.

Before we discuss the algorithm, we need to clarifytwo forms of constraints implicitly expressed in aquery. One form is an internal disjunction, a setof disjunctions on the values of an attribute. Forexample, an instance of geolo¢ satisfies:C1:geoloc (?name,_, ?cry,.,_),member (?cry, ["Tunisia", "Italy" , "Libya"] ).

iff its ?cry value is "Tunisia", ’"Italy", or"Libya". The other form is a join constraint, whichcombines instances from two relations. For example,a pair of instances of geoloc and seaport satisfy ajoin constraint:C2: geoloc (?name I, ?glc_cd, _, _, _),

seaport (?name2, ?glc_cd, _, _, _, _).

iff they share common values on the attribute glc_cd(geographic location code).

Out’inductive learning algorithm is extended fromthe greedy algorithm that learns internal disjunc-tions proposed by [7]. Of the many inductive learn-ing algorithms, Haussler’s was chosen because itshy-pothesis description language is the most similar toours. His algorithm starts from an empty hypothesisof the target concept description to be learned. Thealgorithm proceeds by constructing a set of candi-date constraints that are consistent with all positiveexamples, and then using a gain/cost ratio as theheuristic function to select and add candidates to thehypothesis. This process of candidate constructionand selection is repeated until no negative instancesatisfies the hypothesis.

We extended Haussler’s algorithmto allow joinconstraints in the target description. To achieve this,we extended the candidate construction step to allowjoin constraints to be considered, and we extendedthe heuristic function to evaluate both internal dis-junctions and join constraints. Also, we adopted anapproach to searching the space of candidate con-straints that restricts the size of the space.

4.1 CONSTRUCTING AND EVAL-UATING CANDIDATE CON-STRAINTS

In this subsection, we describe how to construct acandidate constraint, which can be either an internaldisjunction or a join constraint. Then we describe amethod for evaluating both internal disjunctions andjoin constraints. Given a relation partitioned intopositive and negative instances, we can construct aninternal disjunctive constraint for each attribute bygeneralizing attribute values of positive instances.The constructed constraint is consistent with posi-tive instances because it is satisfied by all positiveinstances. Similarly, we can construct a join con-straint consistent with positive instances by testingwhether all positive instances satisfy the join con-straint. The constructed constraints are candidatesto be selected by the system to form an alternativequery.

For example, suppose we have a database thatcontains the instances as shown in Figure 2. In thisdatabase, instances labeled with "÷ ’ ’ are positiveinstances. Suppose the system is testing whetherjoin constraint C2 is consistent with the positive in-stances. Since for all positive instances, there is acorresponding instance in seaport with a commonglc~cd value, the join constraint C2 is consistent andis considered as a candidate constraint.

Once we have constructed a set of candidate inter-nal disjunctive constraints and join constraints, weneed to measure which one is the most promising andadd it to the hypothesis. In Haussler’s algorithm,the measuring function is a gain/cost ratio, wheregain is defined as the number of negative instancesexcluded and cost is defined as the syntactic lengthof a constraint. This heuristic is based on the gen-eralized problem of minimum set cover where eachset is assigned a constant cost. Haussler used thisheuristic to bias the learning for short hypotheses.In our problem, we want the system to learn a queryexpression with the least cost. In real databases,sometimes additional constraints can reduce queryevaluation cost. So we keep the gain part of theheuristic, while defining the cost of the function asthe estimated evaluation cost of the constraint by adatabase system.

The motivation of this formula is also from thegeneralized minimum set covering problem. Thegain/cost heuristic has been proved to generate aset cover within a small ratio bound (ln Inl + 1) the optimal set covering cost [5], where n is the num-

KDD-94 AAA1-94 Workshop on Knowledge Discovery in Databases Page 315

Page 6: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

geoloc ("Saf aqis", 8001,geoloc ("Valletta", 8002,geoloc ("Ma.reaxlokk", 8003,geoloc ("San Pawl", 8004,geolo c ("Marsalforn", 8005,geoloc (°°Abano’° , 8006,geoloc ("Torino", 8007,geoloc (°°Venezia’, , 8008,

:

Tunis ia,Malta,Malta,

Malta,Malta,Italy,Italy,Italy,

..)

. .)+

¯ .)+

. .)+

¯ .)+

..)

..)

..)

seaport("Mareaxlokk" 8003 ...)seaport("Grand Harbor" 8002 ...)seaport ("Marsa" 8005 ...)seaport("St Pauls Bay" 8004 ...)seaport ("Cat ania" 8016 ...)seaport ("Palermo" 8012 ...)seaport("Traparri" 8015 . :.)seaport("AbuKamash" 8017 ...)

Figure 2: The Database Fragment

ber of input sets. However, in this problem, the cost

of a set is a constant and the total cost of the entireset covers is the sum of the cost of each set. Thisis not always the case for database query execution,where the cost of each constraint is dependent onthe execution ordering. To estimate the actual costof a constraint is very expensive. We therefore usean approximation heuristic here.

The evaluation cost of individual constraints canbe estimated using standard database query opti-mization techniques [21] as follows. Let DI denotethe constraining relation, and I~)x I denote the size of

a relation, then the evaluation cost for an internaldisjunctive constraint is proportional to

because for an internal disjunction on an attributeo

that i~ not indexed, a query evaluator has to scanthe entire database to find all satisfying instances.If the internal disjunction is on an indexed attribute,then the cost should be proportional to the numberof instances satisfying the constraint. In both cases,the system can always sample the database queryevaluator to obtain accurate execution costs.

For join constraints, let ~2 denote the new re-lations introduced by a join constraint, and ZI, Z2denote the cardinality of join attributes of two re-lations, that is, the number of distinct values of at-tributes over which ~I and ~2 join. Then the evalu-ation cost for the join over ~)I and ~2 is proportional

to

when the join is over attributes that are not indexed,because the query evaluator must compute a crossproduct to locate pairs of satisfying instances. Ifthe join is over indexed attributes, the evaluationcost is proportional to the number of instance pairs

returned from the join, that is,

maz(zl,z .) This estimate assumes that distinct attribute val-ues distribute uniformly in instances of joined re-lations. Again, if possible, the system can samplethe database for more accurate execution costs. Forthe above example problem, we have two candidateconstraints that are the most promising:C3 :geoloc(?name,_,"Malta",_,_).

C4:geoloc (?nmne, ?gl c.cd, _, _, .),seaport (_, ?glc.cd, _ ...... ).

Suppose I gsolocl is 300, and I ssaportl is 8. Car-dinality of gZc_cd for geoloc is 300 again, and forseaport is 8. Suppose both relations have indices onglc_cd. Then the evaluation cost of C3 is 300, andC4 is 300*8/300 = 8. The gain oft3 is 300-4 = 296,and the gain of C4 is 300 - 8 = 292, because only4 instances satisfy C3 while 8 instances satisfy C4.(There are 8 seaports, and all have a correspond-ing geoloc instance.) So the gain/cost ratio of C3is 296/300 = 0.98, and the gain/cost ratio of C4 is292/8 = 36.50. The system will select C4 and add itto the hypothesis.

4.2 SEARCHING THE SPACE OFCANDIDATE CONSTRAINTS

When a join constraint is selected; a new relationand its attributes are introduced to the search spaceof candidate constraints. The system can consideradding constraints on attributes of the newly intro-duced relation to the partially constructed hypothe-sis. In our example, a new relation seaport is intro-duced to describe the positive instances in geoloc.The search space is now expanded into two layers,as illustrated in Figure 3. The expanded constraints

~ge 316 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

Page 7: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

include a set of internal disjunctions on attributes ofseaport, as well as join constraints from seaportto another relation. If a new join constraint hasthe maximum gain/cost ratio and is selected later,the search space will he expanded further. Fig-ure 3 shows the situation when a new relation, saychannel, is selected, the search space will be ex-panded one layer deeper. At this moment, candi-date constraints will include all tmselected internaldisjunctions on attributes of geoloc, seaport, and¢hamlel, as well as all possible joins with new re-lations from geoloc, seaport and ch~--el. Ex-haustively evaluating the gain/cost of all candidateconstraints is impractical when learning from a largeand complex database.

We adopt a search method that favors candidateconstraints on attributes of newly introduced rela-tions. That is, when a join constraint is selected,the system will estimate only those candidate con-straints in the newly expanded layer, until the sys-tem constructs a hypothesis that excludes all neg-ative instances (i.e., reaches the goal)or no moreconsistent constraints in the layer with positive gainare found. In the later, case, the system will back-track to search the remaining constraints on previouslayers. This search control bias takes advantage ofunderlying domain knowledge in the schema designof databases. A join constraint is unlikely to be se-lected on average, because an internal disjunction isusually much less expensive than a joi n. Once a joinconstraint (and thus a new relation) is selected, thisis strong evidence that all useful internal disjunc-tions in the current layer have been selected, and itis more likely that useful candidate constraints areon attributes of newly joined relations. This biasworks well in our experiments. But certainly thereare cases when this search heuristic prunes out use-

ful candidate constraints. Another way to bias thesearch is by including prior knowledge for learning.In fact, it is quite natural to include prior knowledgein our algorithm, and we will discuss this later.

Returning to the example, since C4 was selected,the system will expand the search space by con-structing consistent internal disjunctions and joinconstraints on seaport, Assuming that the systemcannot find any candidate on seaport with posi-tive gain, it will backtrack to consider constraintson geoloc again. Next, the constraint on co~mtxTis selected (see Figure 3) and all negative instancesare excluded. The system thus learns the query:

03: answer(?name) :-geoloc (?name, ?glc_cd, ")ialta", _, _),seaport (-, ?glc.cd, _, _, _, _).

The operationalizstion component will then take 01and this learned query 03 as s training example forreformulation,

geoloc(?name, _, "Malta" , _,_)¢$ geoloc (?name, ?glc-cd, "Malta", _,_) A

seaport (-, ?glc.-cd, _, -, -, -).

and deduce a new rule to reformulate 01 to 03:geoloc (_, ?glc-cd, "Malta", _, _)::~ seaport (_, ?glc_cd, _,_,_,_).

This is the rule It2 we have seen in Section 2. Since¯ the size of geoloc is considerably larger than thatOf seaport,¯ next time when a query asks about geo-graphic locations in Malta, the system can reformu-late the query to access the seaport relation insteadand speed up the query answering process.

The algorithm can be further enhanced by includ-ing prior knowledge to reduce the search space. Theidea is to use prior knowledge, such as determina-tions proposed by [13], to sort candidate constraintsby their comparative relevance, and then test theirgain/cost ratio in this sorted order. For example,assuming that from its prior knowledge the systemknows that the constraints on attributes latitudeand longitude ofgeoloc are unlikely to be relevant,then the system can ignore them and evaluate candi-date constraints on the other attributes first. If theprior knowledge is correct, the system will constructa consistent hypothesis with irrelevant constraintsbeing pruned from the search space. However, if thesystem cannot find a constraint that has a positivegain, then the prior knowledge may be wrong, andthe system can backtrack to consider "irrelevant"constraints and try to construct a hypothesis fromthem. In this way, the system can tolerate incorrectand incomplete prior knowledge. This usage of priorknowledge follows the general spirit of FOCL [12].

5 Experimental Results

Our experiments are performed on the SIMSknowledge-based information server [2, 8]. SIMSallows users to access different kinds of remotedatabases and knowledge bases as if they were us-ing a single system. For the purpose of our exper-iments, SIMS is connected with three remotely dis-tributed Oracle databases via the Internet. Table 1shows the domain of the contents and the sizes ofthese databases. We had 34 sample queries writtenby users of the databases for the experiments. Weclassified these queries into 8 categories according tothe relations and constraints used in the queries. We

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 317

Page 8: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

~name - *Valletta" V *Marsaxlokk* V...

glc_cx:l ,,, 8002 V 8003 V ..., ,

/~.~ country - "Malta" / name - "Grand Harbor" V "Marsa V St Pauls Bay’...

..ol09/~ latitude .... //A rail,=Yes~longitude .... .eaport~=//’ T

\\~ ~’~join on glc_cd with seaports C ~ storage > 2000000 /\\\,- - \\ =,,.==.z//

join on name with ?seaports~’X~"~j \’~ join on name and port_name with channel (~

\ "join on glc_cd with channel~ i

\\,

\ II join on glc_cd with geoloc \’~

t

Figure 3: Candidate Constraints to be Selected

then chose 8 queries randomly from each category asinput to the learning system and generated 32 rules.These rules were used to reformulate the remaining26 queries. In addition to learned rules, the systemalso used 163 attribute range facts (e.g., the rangeof the storage attribute of seaport is between 0

and 100,000) compiled from the databases. Rangefacts are useful for numerically typed attributes inthe rule matching.

The performance statistics for query reformula-tion are shown in Table 2. In the first column, weshow the average performance of all tested queries.We diyide the queries into 3 groups. The numberof queries in each group is shown in the first row.The first group contains those unsatisfiable queriesrefuted by the learned knowledge. In these cases,the reformulation takes full advantage of the learnedknowledge and the system does not need to accessthe databases at all, so we separate them from theother cases. The second group contains those low-cost queries that take less than one minute to evalu-ate without reformulation. The last group containsthe high-cost queries.

The second row lists the average elapsed time ofquery execution without reformulation. The thirdrow shows the average elapsed time of reformulationand execution. Elapsed time is the total query pro-cessing time, from receiving a query to displaying allanswers. To reduce inaccuracy due to the random la-tency time in network transmission, all elapsed timedata are obtained by executing each query 10 timesand then computing the average. The reformulationyields significant cost reduction for high-cost queries.The overall average gain is 57.10 percent, which is

better than systems using hand-coded rules for se-mantic optimization [8, 15, 17]. The gains are not sohigh for the low-cost group. This is not unexpected,because the queries in this group are already verycheap and the cost cannot be reduced much further.The average overheads listed in the table show thetime in seconds used in reformulation. This overheadis very small compared to the total query processingtime. On average, the system fires rules 5 times forreformulation. Note that the same rule may be firedmore than once during the reformulation procedure(see (Hsu & Knoblock 1993) for more detailed scriptions).

6 RELATED WORK

Previously, two systems that learn backgroundknowledge for s~mantic query optimization were pro-posed by [18] and by [14]. Siegel’s system uses pre-defined heuristics to drive learning by an examplequery. This approach is limited because the heuris-tics are unlikely to be comprehensive enough to de-tect missing rules for various queries and databases.Shekhar’s system is a data-driven approach whichassumes that a set of relevant attributes is given.Focusing on these relevant attributes, their systemexplores the contents of the database and generatesa set of rules in the hope that all useful rules arelearned. Siegel’s system goes to one extreme by ne-glecting the importance of guiding the learning ac-cording to the contents of databases, while Shekhar’ssystem goes to another extreme by neglecting dy-namic query usage patterns. Our approach is moreflexible because it addresses both aspects by using

Page 318 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94

Page 9: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

Table 1: Database Features

I Databases l Contents [, l~,lstions Instances [ Size{MB) 56?O8

Assets16 10.48

Air and sea assets 14 5728 0.51Fmllb Force module library 8 ¯ 3528 1.05

Table 2: Performance Statistics

t#All ;0s.

of queries 26 Answerinferredl<60s’l>-174

t No reformulation 54.27 44.58 10.11 212.21Reformulation 23.28 5.45 8.79 86.78

Time saved’ %’ Gain of total elapsed time

30.99 1.31 125.4657.1%

,39.14¯ i

87.8% 12.9% 59.1%

IAvera eoverhead l’ 0"08 I 0.07 I 0.07 [ 0.11Times rule fired 15.ooI 8.00 14.1817.o0

example queries to trigger the learning and using in-ductive learning over the contents of databases forsemantic rules.

The problem of inductive learning from a databasewith multiple relations shares many issues with re-search work in inductive logic programming (ILP)(Muggleton et al. 1904), especially the issue when to introduce new relations. The main dif-ference between our approach and ILP is that ~ wealso consider the cost of the learned concept de-scription. Our system currently learns only single-clause, non-recursive queries, while ILP approachescan lelLrn multi-clause and recursive rules. How-ever, due to the complexity of the problem, mostof the existing ILP approaches do not scale up wellto learn from large, real-world data/knowledge-basescontaining more than ten relations with thousandsof instances. Our approach can learn from largedatabases because it also uses the knowledge under-lying the database design.

Tan’s cost-sensitive learning [19] is an inductivelearning algorithm that also takes the cost of thelearned description into account. His algorithm triesto learn minimum-cost decision trees from examplesin a robot object-recognition domain. The algo-rithm selects a minimum number of attributes toconstruct a decision tree for recognition. The at-tributes are selected in the order of their evaluationcost. When constructing a decision tree, it uses aheuristic attribute selection function I2/C, where Iis the information gain defined as in ID3, and C isthe cost to evaluate a given attribute. This func-tion is similar to our function gain~evaluation_cost.While there is no theoretic analysis about the gen-

eral performance of the heuristic I~/C for decision-tree learning, our function is derived from approxi-mation heuristics for minimum set cover problems.

[11] defined another similar heuristic (2z - 1)/Cfor cost-sensitive decision-tree learning. His paperprovides an information-theoretic motivation of theheuristic.

[4] present an attribute-oriented learning ap-proach designed to learn from relational databases.The approach learns conjunctive rules by generaliz-ing instances of a single relation. The generalizationoperations include replacing attribute values withthe least common ancestors in a value hierarchy, re-moving inconsistent attributes, and removing dupli-cate instances. In contrast to our inductive learningalgorithm, this attribute-oriented approach requiresusers to select relevant attributes before learning canbe performed.

The operationalization component in our learn-ing approach can be enhanced with an EBL-like ex-plainer to filter out low utility rules and general-ize rules. A similar "induction-first then EBL" ap-proach can be found in [16]. Shen’s system usesgeneral heuristics to guide the inductive learning forregularities expressed in a rule template P(z, y) R(y, z) :~ Q(z, z). system has a defi nite goal ,so we use example queries to guide the learning anddo not restrict the format of learned rules to a spe-cific template.

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 319

Page 10: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

7 Conclusions and Future Work

This paper demonstrates that the knowledge re-quired for semantic query optimization can belearned inductively under the guidance of example

queries. We have described a general approach inwhich inductive learning is triggered by examplequeries, and an algorithm to learn from a database

with multiple relations. Experimental results showthat query reformulation using learned backgroundknowledge produces substantial cost reductions fora real-world intelligent information server.

In future work, we plan to experiment with differ-ent ways of selecting example queries for training,and to develop an effective approach to using prior

knowledge for Constraining searches in the inductivelearning algorithm. We also plan to enhance the op-erationalization component so that the system can

be more selective and thus avoid the utility problem.

A limitation to our approach is that there is nomechanism to deal with changes to data/knowledgebases. There are three possible alternatives to ad-dress this problem. First, the system can simply re-move the invalid rules due to the updateand let thesystem learn from future queries after the update.Second, the system can predict the expected utilityof each rule, and choose to update or re-learn a sub-set of invalid rules. Third, the system can update orre-learn all rules after the update. We plan to ex-periment with all of these alternatives and proposean approach to let the system decide which updatealternative is the most appropriate for an expectedmodel of database change.

Acknowledgements

We wish to thank Yigal Arens and Chin Y. Chee fortheir help on this work. Thanks also to the anony-mous reviewers for their valuable comments on ear-lier drafts of this paper. The research reported herewas supported in part by the National Science Foun-dation under grant No. IRI-9313993 and in part byRome Laboratory of the Air Force Systems Com-mand and the Advanced Research Projects Agency.¯

under Contract No. F30602-91-C-0081.

References [15][1] Hussein Almuallim and Tom G. Dietterich. Learn-

ing with many i~relevant features. In Proceedings ofthe Ninth National Conference on Artificial Intelli-gence, pages 547-552, Anaheim, CA, 1991.

[2] Yigal Arena, Chin Y. Chee, Chun-Nan Hsu, andCraig A. Knoblock. Retrieving and integrating datafrom "multiple information sources. International?ournal on Intelligent and Cooperative InformationSystems, 2(2):127-159, 1993.

[3] R.J. Brachman and J.G. Schmolze. An overviewof the KL-ONE knowledge representation system.Cognitive Science, 9(2):171-216, 1985.

[4] Yandong Cai, Nick Cercone, and Jiawei Hall. Learn-ing in relational databases: An attribute-orientedapproach. Computational Intelligence, 7(3):119-132, 1991.

[5] Vasek. Chvatal. A greedy heuristic for the set cover-ing problem. Mathematics of Operations Research,4, 1979.

[6] Thomas H. Carmen, Charles E. Leiserson, andRonald L. Rivest. Introduction To Algorithms. TheMIT Prees/McGraw-HiU Book Co., Cambridge,MA, 1989.

[7] David Hanssler. Quantifying inductive bias: AIlearning algorithms and Valiant’s learning frame-work.̄ Artificial Intelligence, 36:177-221, 1988.

[8] Chun-Nan Hsu and Craig A. Knoblock. Reformulat-

¯ ing query plans for multidatabase systems. In Pro-ceedings of the Second International Conference onInformation and Knowledge Management, Wash-ington, D.C., 1993.

[9] Jonathan J. King. Query Optimization by Seman-tic Reasoning. PhD thesis, Stanford University, De-partment of Computer Science, 1981.

[!0]: Steven Minton. Learning Effective Search ControlKnowledge: An Explanation-Based Approach. PhDthesis, Carnegie Mellon University, School of Com-puter Science, 1988.

[11] Marion Nunez. The use of background knowledge indecision tree induction. Machine Learning, 6:231-250, 1991.

[12] Micheal J. Pazzani and Dennis Kibler. The utility ofknowledge in inductive learning. Machine Learning,9:57-94, 1992.

[13] Stnart J. Russell. The Use of Knowledge in Analogyand Induction. Morgan Kanfmann, San Mateo, CA,1989.

[14] Shashi Shekhar, Bahak Hamidzadeh, Ashim Kohll,and Mark Coyle. Learning transformation rulesfor semantic query optimization: A data-driven ap-proach. IEEE Transactions on Knowledge and DataEngineering, 5(6):950-964, 1993.

Shashi Shekhar, Jaideep Srlvastffiva, and SoumitraDurra. A formal model of trade-off between opti-mization and execution costs in semantic query op-timization: In Proceedings of the 14th VLDB Con.ference, Los Angeles, CA, 1988.

Page 320 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94

Page 11: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

[16] Wei-Min Shen. Discovering regularities from knowl-edge bases. International Journal of Intelligent Sys-tems, 7:623-635, 1992.

[17] Sreekumar T. Shenoy and Zehra M. Ozsoyoglu. De-sign and implementation of a semantic query opti-mizer. IEEE Trans. Knowledge and Data Engineer-ing, I(3):344-361, 1989.

[18] Michael D. Siegel. Automatic rule derivation forsemantic query optimization. In Larry Kerschberg,editor, Proceedings of the Second International Con-]erence on Expert Database Systen~s, pages 371-385.George Mason Foundation, Faidax, VA, 1988.

[19] Ming "13m. Cost-sensitive learning of classificationknowledge ud its application in robotics. MachineLearning, 13:7-33, 1993.

[20] Jeffrey D. Ullman. Principles of Database andKnowledge-base Systems, volume I. Computer Sci-ence Press, Palo Alto, CA, 1988.

[21] Jeffzey D. Uilman. Principles of Database andKnowledge-base Systems, volume II. Computer Sci-ence Press, Palo Alto, CA, 1988.

KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 321

Page 12: Rule Induction for Semantic Query Optimization · late multidatabase query access plans and showed that the reformulations produce substantial perfor-mance improvements [8]. We conclude

Page 322 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94