[IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Answering Imprecise Queries over Autonomous Web Databases

Download [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Answering Imprecise Queries over Autonomous Web Databases

Post on 11-Mar-2017




0 download

Embed Size (px)


<ul><li><p>Answering Imprecise Queries over Autonomous Web Databases</p><p>Ullas Nambiar &amp; Subbarao Kambhampati</p><p>Department of Computer Science &amp; EngineeringArizona State University, USA</p><p>E-mail: {ubnambiar, rao}@asu.edu</p><p>Abstract</p><p>Current approaches for answering queries with impre-cise constraints require user-specific distance metrics andimportance measures for attributes of interest - metrics thatare hard to elicit from lay users. We present AIMQ, a do-main and user independent approach for answering impre-cise queries over autonomous Web databases. We devel-oped methods for query relaxation that use approximatefunctional dependencies. We also present an approach toautomatically estimate the similarity between values of cat-egorical attributes. Experimental results demonstrating therobustness, efficiency and effectiveness of AIMQ are pre-sented. Results of a preliminary user study demonstratingthe high precision of the AIMQ system is also provided.</p><p>1 Introduction</p><p>Database query processing models have always assumedthat the user knows what she wants and is able to formulatea query that accurately expresses her needs. But with therapid expansion of the World Wide Web, a large number ofdatabases like bibliographies, scientific databases etc. arebecoming accessible to lay users demanding instant grati-fication. Often, these users may not know how to preciselyexpress their needs and may formulate queries that lead tounsatisfactory results. Although users may not know howto phrase their queries, when presented with a mixed setof results having varying degrees of relevance to the querythey can often tell which tuples are of interest to them. Thusdatabase query processing models must embrace the IR sys-tems notion that user only has vague ideas of what shewants, is unable to formulate queries capturing her needsand would prefer getting a ranked set of answers. Thisshift in paradigm would necessitate supporting imprecisequeries. This sentiment is also reflected by several databaseresearchers in a recent database research assessment [1].</p><p>Current affiliation: Dept. of Computer Science, University of Califor-nia, Davis.</p><p>In this paper, we present AIMQ [19] - a domain and userindependent solution for supporting imprecise queries overautonomous Web databases1. We will use the illustrativeexample below to motivate and provide an overview of ourapproach.</p><p>Example: Suppose a user wishes to search for sedanspriced around $10000 in a used car database, CarDB(Make,Model, Year, Price, Location). Based on the databaseschema the user may issue the following query:</p><p>Q:- CarDB(Model = Camry, Price &lt; 10000)On receiving the query, CarDB will provide a list of Camrysthat are priced below $10000. However, given that Accordis a similar car, the user may also be interested in viewingall Accords priced around $10000. The user may also beinterested in a Camry priced $10500. </p><p>In the example above, the query processing model usedby CarDB would not suggest the Accords or the slightlyhigher priced Camry as possible answers of interest as theuser did not specifically ask for them in her query. This willforce the user to enter the tedious cycle of iteratively issu-ing queries for all similar models before she can obtaina satisfactory answer. One way to automate this is to pro-vide the query processor information about similar models(e.g. to tell it that Accords are 0.9 similar to Camrys). Whilesuch approaches have been tried, their achilles heel has beenthe acquisition of the domain specific similarity metricsaproblem that will only be exacerbated as the publicly acces-sible databases increase in number.</p><p>This is the motivation for the AIMQ approach: ratherthan shift the burden of providing the value similarity func-tions and attribute orders to the users, we propose a domainindependent approach for efficiently extracting and auto-matically ranking tuples satisfying an imprecise query overan autonomous Web database. Specifically, our intent is tomine the semantics inherently present in the tuples (as theyrepresent real-world objects) and the structure of the rela-tions projected by the databases. Our intent is not to takethe human being out of the loop, but to considerably reduce</p><p>1We use the term Web database to refer to a non-local autonomousdatabase that is accessible only via a Web (form) based interface.</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>the amount of input she has to provide to get a satisfac-tory answer. Specifically, we want to test how far we cango (in terms of satisfying users) by using only the informa-tion contained in the database: How closely can we modelthe users notion of relevance by using only the informationavailable in the database?</p><p>Below we illustrate our proposed solution, AIMQ, andhighlight the challenges raised by it. Continuing with theexample given above, let the users intended query be:</p><p>Q:- CarDB(Model like Camry, Price like 10000)</p><p>We begin by assuming that the tuples satisfying some spe-cialization of Q called the base query Qpr, are indicativeof the answers of interest to the user. For example, it islogical to assume that a user looking for cars like Camrywould be happy if shown a Camry that satisfies most of herconstraints. Hence, we derive Qpr2 by tightening the con-straints from likeliness to equality:</p><p>Qpr:- CarDB(Model = Camry, Price = 10000)</p><p>Our task then is to start with the answer tuples for Qpr called the base set, (1) find other tuples similar to tuplesin the base set and (2) rank them in terms of similarity toQ. Our idea is to consider each tuple in the base set as a(fully bound) selection query, and issue relaxations of theseselection queries to the database to find additional similartuples. For example, if one of the tuples in the base set is</p><p>Make=Toyota, Model=Camry, Price=10000, Year=2000</p><p>we can issue queries relaxing any of the attribute bindingsin this tuple. This idea leads to our first challenge: Whichrelaxations will produce more similar tuples? Once we han-dle this and decide on the relaxation queries, we can issuethem to the database and get additional tuples that are sim-ilar to the tuples in the base set. However, unlike the basetuples, these tuples may have varying levels of relevance tothe user. They thus need to be ranked before being pre-sented to the user. This leads to our second challenge: Howto compute the similarity between the query and an answertuple? Our problem is complicated by our interest in mak-ing this similarity judgement not depend on user-supplieddistance metrics.</p><p>Contributions: In response to these challenges, we devel-oped AIMQ - a domain and user independent imprecisequery answering system. Given an imprecise query, AIMQbegins by deriving a precise query (called base query) thatis a specialization of the imprecise query. Then to extractother relevant tuples from the database it derives a set ofprecise queries by considering each answer tuple of the basequery as a relaxable selection query.3 Relaxation involves</p><p>2We assume a non-null resultset for Qpr or one of its generalizations.The attribute ordering heuristic we describe later in this paper is useful inrelaxing Qpr also.</p><p>3The technique we use is similar to the pseudo-relevance feedbacktechnique used in IR systems.Pseudo-relevance feedback (also known aslocal feedback or blind feedback) involves using top k retrieved documents</p><p>extracting tuples by identifying and executing new queriesobtained by reducing the constraints on an existing query.However, randomly picking attributes to relax could gener-ate a large number of tuples with low relevance. In theory,the tuples closest to a tuple in the base set will have dif-ferences in the attribute that least affects the binding valuesof other attributes. Such relationships can be captured byapproximate functional dependencies (AFDs). Therefore,AIMQ makes use of AFDs between attributes to determinethe degree to which a change in the value of an attribute af-fects other attributes. Using the mined attribute dependencyinformation AIMQ obtains a heuristic to guide the queryrelaxation process. To the best of our knowledge, there isno prior work that automatically learns attribute importancemeasures (required for efficient query relaxation). Hence,the first contribution of AIMQ is a domain and user inde-pendent approach for learning attribute importance. The tu-ples obtained after relaxation must be ranked in terms oftheir similarity to the query. While we can by default use aLp distance metric such as Euclidean distance to capturesimilarity between numerical values, no such widely ac-cepted measure exists for categorical attributes. Therefore,the second contribution of this paper (and AIMQ system) isan association based domain and user independent approachfor estimating similarity between values binding categoricalattributes.</p><p>Organization: In the next section, Section 2, we list relatedresearch efforts. An overview of our approach is given inSection 3. Section 4 explains the AFD based attribute order-ing heuristic we developed. Section 5 describes our domainindependent approach for estimating the similarity amongvalues binding categorical attributes. Section 6 presentsevaluation results over two real-life databases, Yahoo Au-tos and Census data, showing the robustness, efficiency ofour algorithms and the high relevance of the suggested an-swers. We summarize our contributions in Section 7.</p><p>2 Related Work</p><p>Information systems based on the theory of fuzzysets [15] were the earliest to attempt answering querieswith imprecise constraints. The WHIRL [4] system pro-vides ranked answers by converting the attribute values inthe database to vectors of text and ranking them using thevector space model. In [16], Motro extends a conventionaldatabase system by adding a similar-to operator that usesdistance metrics given by an expert to answer vague queries.Binderberger [20] investigates methods to extend databasesystems to support similarity search and query refinementover arbitrary abstract data types. In [7], Goldman et al pro-pose to provide ranked answers to queries over Web data-bases but require users to provide additional guidance in</p><p>to form a new query to extract more relevant results.</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>Figure 1. AIMQ system architecture</p><p>deciding the similarity. However, [20] requires changingthe data models and operators of the underlying databasewhile [7] requires the database to be represented as a graph.In contrast, our solution provides ranked results without re-organizing the underlying database and thus is easier to im-plement over any database. A recent work in query opti-mization [12] also learns approximate functional dependen-cies from the data but uses it to identify attribute sets forwhich to remember statistics. In contrast, we use it for cap-turing semantic patterns from the data.</p><p>The problem of answering imprecise queries is relatedto three other problems. They are (1) Empty answersetproblem- where the given query has no answers and needsto the relaxed. In [17], Muslea focuses on solving theempty answerset problem by learning binding values andpatterns likely to generate non-null resultsets. Coopera-tive query answering approaches have also looked at solv-ing this problem by identifying generalizations that will re-turn a non-null result [6]. (2) Structured query relaxation- where a query is relaxed using only the syntactical infor-mation about the query. Such an approach is often used inXML query relaxation e.g. [21]. (3) Keyword queries indatabases - Recent research efforts [2, 10] have looked atsupporting keyword search style querying over databases.These approaches only return tuples containing at least onekeyword appearing in the query. The results are then rankedusing a notion of popularity captured by the links. The im-precise query answering problem differs from the first prob-lem in that we are not interested in just returning some an-swers but those that are likely to be relevant to the user.It differs from the second and third problems as we con-sider the semantic relaxations rather than the purely syntac-tic ones.</p><p>3 The AIMQ approach</p><p>The AIMQ system as illustrated in Figure 1 consists offour subsystems: Data Collector, Dependency Miner, Sim-ilarity Miner and the Query Engine. The Data Collec-tor probes the databases to extract sample subsets of thedatabases. Dependency Miner mines AFDs and approxi-</p><p>mate keys from the probed data and uses them to deter-mine a dependence based importance ordering among theattributes. This ordering is used by the Query Engine forefficient query relaxation and by the Similarity Miner to as-sign weights to similarity over an during ranking. The Simi-larity Miner uses an association based similarity mining ap-proach to estimate similarities between categorical values.Figure 2 shows a flow graph of our approach for answeringan imprecise query.</p><p>3.1 The Problem</p><p>Given a conjunctive query Q over an autonomous Webdatabase projecting the relation R, find all tuples of R thatshow similarity to Q above a threshold Tsim (0, 1).Specifically,</p><p>Ans(Q) = {x|x R, Similarity(Q,x)&gt; Tsim}Constraints: (1) R supports the boolean query processingmodel (i.e. a tuple either satisfies or does not satisfy aquery). (2) The answers to Q must be determined with-out altering the data model or requiring additional guidancefrom users. </p><p>3.2 Finding Relevant Answers</p><p>Imprecise Query: A user query that requires a close butnot necessarily exact match is an imprecise query. Answersto such a query must be ranked according to their close-ness/similarity to the query constraints. For example, thequery Q:- CarDB(Make like Ford) is an imprecise query,the answers to which must have the attribute Make boundby a value similar to Ford.</p><p>Our proposed approach for answering an imprecise se-lection query over a database is given in Algorithm 1. Givenan imprecise query Q to be executed over relation R, thethreshold of similarity Tsim and the attribute relaxation or-der Arelax (derived using Algorithm 2 in Section 4), webegin by mapping the imprecise query Q to a precise queryQpr having a non-null answerset (Step 1). The set of an-swers for the mapped precise query forms the base set Abs.By extracting tuples having similarity above a predefined</p><p>Proceedings of the 22nd International Conference on Data Engineering (ICDE06) 8-7695-2570-9/06 $20.00 2006 IEEE </p></li><li><p>Figure 2. FlowGraph of AIMQs query answering approach</p><p>Algorithm 1 Finding Relevant Answers</p><p>Require: Q, R, Arelax, Tsim1: Let Qpr = {Map(Q)|Abs = Qpr(R), |Abs| &gt; 0}2: t Abs3: Qrel = CreateQueries(t, Arelax)4: q Qrel5: Arel = q(R)6: t Arel7: if Sim(t, t) &gt; Tsim8: Aes = Aes</p><p>t</p><p>9: Return Top-k(Aes).</p><p>threshold, Tsim, to the tuples in Abs we can get a largersubset of potential answers called extended set (Aes). Toensure more relevant tuples are retrieved after relaxation,we use the Al...</p></li></ul>


View more >