answering imprecise queries over web databases

Answering Imprecise Queries over Web Databases

Ullas Nambiar and Subbarao Kambhampati

Department of CS & EngineeringArizona State University

VLDB , Aug 30 – Sep 02, 2005, Trondheim, Norway

Why Imprecise Queries ?

Want a ‘sedan’ priced around $7000

A Feasible Query

Make =“Toyota”, Model=“Camry”,

Price ≤ $7000

What about the price of a Honda Accord?

Is there a Camry for $7100?

Solution: Support Imprecise Queries

………

1998$6500CamryToyota




The Imprecise Query Answering Problem

Problem Statement: Given a conjunctive query Q over a relation R, find a ranked set of tuples of R that satisfy Q above a threshold of similarity Tsim.

Ans(Q) ={x|x Є R, Similarity(Q,x) >Tsim}

Constraints:– Autonomous Database

• Data accessible only by querying

• Data model, operators etc not modifiable

– Supports boolean model (relation)

Existing Approaches

Similarity search over Vector space– Data must be stored as vectors of text

WHIRL, W. Cohen, 1998

Enhanced database model– Add ‘similar-to’ operator to SQL.

Distances provided by an expert/system designer

VAGUE, A. Motro, 1998– Support similarity search and query

refinement over abstract data typesBinderberger et al, 2003

User guidance– Users provide information about objects

required and their possible neighborhoodProximity Search, Goldman et al, 1998

Limitations:

1. User/expert must provide similarity measures

2. New operators to use distance measures

3. Not applicable over autonomous databases

Motivation & Challenges what we did ! Objectives

– Minimal burden on the end user

– No changes to existing database

– Domain independent

Motivation– Mimic relevance based

ranked retrieval paradigm of IR systems

– Can we learn relevance statistics from database ?

– Use the estimated relevance model to improve the querying experience of users

Challenges Estimating Query-Tuple

Similarity– Weighted summation of

attribute similarities– Syntactic similarity

inadequate– Need to estimate

semantic similarity• Not enough Ontologies

Measuring Attribute Importance– Not all attributes equally

important– Users cannot quantify

importance

DataSource n

Sample Dataset

Data Collector

Probe using Random Sample Queries

Estimate similarity

Similarity Matrix

Similarity Miner

Extract Concepts

Dependency Miner

Mine AFDs & Keys

Weighted Dependencies

Data processing

Query Engine

Map to precise query

Identify & Execute Similar Queries

Result Ranking

Imprecise Query Ranked Tuples

AIMQ

WWW

DataSource 2

DataSource 1

Wrappers

The AIMQ approach

ImpreciseQuery

QMap: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding

tuples in base set using

– n = Count(Attributes(R)) and Wimp is the importance weight of the attribute

– Euclidean distance as similarity for numerical attributes e.g. Price, Year

– VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model

NumericalAiDomif

AiQ

AitAiQ

lCategoricaAiDomif

AitAiQVSim

AiWtQSimn

i

imp

)(

.

|..|

)(

).,.(

)(),(1

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Deciding Attribute Order Mine AFDs and Approximate

Keys Create dependence graph using

AFDs– Strongly connected hence a

topological sort not possible Using Approximate Key with

highest support partition attributes into

– Deciding set– Dependent set– Sort the subsets using

dependence and influence weights

Measure attribute importance as

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

CarDB(Make, Model, Year, Price)

Decides: Make, YearDepends: Model, Price

Order: Price, Model, Year, Make

1- attribute: { Price, Model, Year, Make}

2-attribute: {(Price, Model), (Price, Year), (Price, Make).. }

•Attribute relaxation order is all non-keys first then keys

•Greedy multi-attribute relaxation

depends

idepends

decides

idecides

iimp

Wt

AWt

or

Wt

AWt

RAttributescount

AlaxOrderAiW

)(

)(

))((

)(Re)(

Empirical Evaluation Goal

– Test robustness of learned dependencies– Evaluate the effectiveness of the query relaxation and

similarity estimation Database

– Used car database CarDB based on Yahoo AutosCarDB( Make, Model, Year, Price, Mileage, Location,

Color)• Populated using 100k tuples from Yahoo Autos

Algorithms – AIMQ

• RandomRelax – randomly picks attribute to relax• GuidedRelax – uses relaxation order determined using

approximate keys and AFDs– ROCK: RObust Clustering using linKs (Guha et al, ICDE

1999)• Compute Neighbours and Links between every tuple

Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples

• Cluster tuples having common neighbours

Robustness of Dependencies

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Model Color Year Make

Dependent Attribute

Depe

nden

ce .

100k50k25k15k

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 6 11 16 21 26

Keys

Quali

ty .

100k50k25k15k

Attribute dependence order & Key quality is unaffected by sampling

Robustness of Value Similarities

Value Similar Values

25K 100k

Make=“Kia” Hyundai 0.17 0.17

Isuzu 0.15 0.15

Subaru 0.13 0.13

Make=“Bronco”

Aerostar 0.19 0.21

F-350 0 0.12

Econoline Van

0.11 0.11

Year=“1985” 1986 0.16 0.16

1984 0.13 0.14

1987 0.12 0.12

Efficiency of Relaxation

0

100

200

300

400

500

600

700

800

900

1 2 3 4 5 6 7 8 9 10

Queries

Wor

k/Re

leva

nt T

uple

Є= 0.7

Є = 0.6

Є = 0.5

•Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7.

•Not resilient to change in Є

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10Queries

Wor

k/R

elev

ant T

uple

Є = 0.7

Є = 0.6

Є = 0.5

•Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7.

•Resilient to change in Є

Random Relaxation Guided Relaxation

Accuracy over CarDB

•14 queries over 100K tuples

• Similarity learned using 25k sample

• Mean Reciprocal Rank (MRR) estimated as

• Overall high MRR shows high relevance of suggested answers

1|)()(|

1)(

ii tAIMQRanktUserRankAvgQMRR

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14Queries

Avera

ge M

RR .

GuidedRelax

RandomRelax

ROCK

AIMQ - Summary

An approach for answering imprecise queries over Web database– Mine and use AFDs to determine attribute order– Domain independent semantic similarity estimation

technique– Automatically compute attribute importance scores

Empirical evaluation shows – Efficiency and robustness of algorithms– Better performance than current approaches– High relevance of suggested answers– Domain independence

answering imprecise queries over web databases

Documents

ranked set of tuples

mapquse base set

base set usingn

support similarity search

threshold of similarity

imprecise query

tuple similaritiesprune

norwaywhy imprecise