answering imprecise queries over autonomous web databases ullas nambiar dept. of computer science...

Answering Imprecise Queries over Autonomous Web

Databases

Ullas NambiarDept. of Computer Science

University of California, Davis

Subbarao Kambhampati

Dept. of Computer ScienceArizona State University

5th April, ICDE 2006, Atlanta, USA

Answering Imprecise Queries over Autonomous Web Databases

Dichotomy in Query Processing

Databases

• User knows what she wants

• User query completely expresses the need

• Answers exactly matching query constraints

IR Systems

• User has an idea of what she wants

• User query captures the need to some degree

• Answers ranked by degree of relevance

http://www.asu.edu/


Why Support Imprecise Queries ?

Want a ‘sedan’ priced around $7000

A Feasible Query

Make =“Toyota”, Model=“Camry”,

Price ≤ $7000

What about the price of a Honda Accord?

Is there a Camry for $7100?

Solution: Support Imprecise Queries

………

1998$6500CamryToyota




http://www.asu.edu/


Others are following …

http://www.asu.edu/


The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user.

Ans(Q) ={x|x Є R, Relevance(Q,x) >c}Objectives

– Minimal burden on the end user – No changes to existing database – Domain independent

Motivation– How far can we go with relevance model estimated from

database ?• Tuples represent real-world objects and relationships

between them – Use the estimated relevance model to provide a ranked

set of tuples similar to the query

What does Supporting Imprecise Queries Mean?

http://www.asu.edu/


Challenges

Estimating Query-Tuple Similarity– Weighted summation of attribute

similarities– Need to estimate semantic similarity

Measuring Attribute Importance– Not all attributes equally important– Users cannot quantify importance

http://www.asu.edu/


Our Solution: AIMQ

ImpreciseQuery

Q

Query Engine

Map: Convert“like” to “=”

Qpr = Map(Q)

Dependency Miner

Use Base Set as set ofrelaxable selection

queries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Similarity Miner

Use Value similaritiesand attribute

importance to measuretuple similarities

Prune tuples belowthreshold

Return Ranked Set

Query Engine

Derive BaseSet Abs

Abs = Qpr(R)

http://www.asu.edu/


An Illustrative Example

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries



Use Concept similarityto measure tuplesimilarities


Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Relation:- CarDB(Make, Model, Price, Year) Imprecise query

Q :− CarDB(Model like “Camry”, Price like “10k”)

Base query

Qpr :− CarDB(Model = “Camry”, Price = “10k”)

Base set Abs

Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”

http://www.asu.edu/


Obtaining Extended Set

ImpreciseQuery


Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Problem: Given base set, find tuples from database similar to tuples in base set.

Solution: – Consider each tuple in base set as a selection query.

e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”

– Relax each such query to obtain “similar” precise queries.e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”

– Execute and determine tuples having similarity above some threshold.

Challenge: Which attribute should be relaxed first?

– Make ? Model ? Price ? Year ?

Solution: Relax least important attribute first.

http://www.asu.edu/


Least Important Attribute Definition: An attribute whose binding value when

changed has minimal effect on values binding other attributes.

– Does not decide values of other attributes– Value may depend on other attributes

E.g. Changing/relaxing Price will usually not affect other attributesbut changing Model usually affects Price

Requires dependence between attributes to decide relative importance

– Attribute dependence information not provided by sources– Learn using Approximate Functional Dependencies & Approximate

Keys • Approximate Functional Dependency (AFD)

X A is a FD over r’, r’ ⊆ rIf error(X A ) = |r-r’|/ |r| < 1 then X A is a AFD over r.

• Approximate in the sense that they are obeyed by a large percentage (but not all) of the tuples in the database

ImpreciseQuery


Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

TANE- an algorithm by Huhtala et al [1999] used to mine AFDs and Approximate Keys

• Exponential in the number of attributes

• Linear in the number of tuples

http://www.asu.edu/


Deciding Attribute Importance Mine AFDs and Approximate

Keys Create dependence graph using

AFDs– Strongly connected hence a

topological sort not possible Using Approximate Key with

highest support partition attributes into

– Deciding set– Dependent set– Sort the subsets using

dependence and influence weights

Measure attribute importance as

ImpreciseQuery


Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

CarDB(Make, Model, Year, Price)

Decides: Make, YearDepends: Model, Price

Order: Price, Model, Year, Make

1- attribute: { Price, Model, Year, Make}

2-attribute: {(Price, Model), (Price, Year), (Price, Make).. }

•Attribute relaxation order is all non-keys first then keys

•Greedy multi-attribute relaxation

depends

idepends

decides

idecides

iimp

Wt

AWt

or

Wt

AWt

RAttributescount

AlaxOrderAiW

)(

)(

))((

)(Re)(

http://www.asu.edu/


Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding

tuples in base set using

– n = Count(Attributes(R)) and Wimp is the importance weight of the attribute

– Euclidean distance as similarity for numerical attributes e.g. Price, Year

– VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model

NumericalAiDomif

AiQ

AitAiQ

lCategoricaAiDomif

AitAiQVSim

AiWtQSimn

i

imp

)(

.

|..|

)(

).,.(

)(),(1

ImpreciseQuery


Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

http://www.asu.edu/


Categorical Value Similarity Two words are semantically

similar if they have a common context – from NLP

Context of a value represented as a set of bags of co-occurring values called Supertuple

Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs

– Measured as the Jaccard Similarity among supertuples representing the values

ST(QMake=Toy

ota)

Model Camry: 3, Corolla: 4,….

Year 2000:6,1999:5 2001:2,……

Price 5995:4, 6500:3, 4000:6

Supertuple for Concept Make=Toyota

JaccardSim(A,B) = BABA

m

i

imp AivSTAivSTJaccardSimAiWvvVSim1

)).2(,).1(()()2,1(

ImpreciseQuery


Qpr = Map(Q)






Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

http://www.asu.edu/


Empirical Evaluation Goal

– Test robustness of learned dependencies– Evaluate the effectiveness of the query relaxation and similarity

estimation

Database– Used car database CarDB based on Yahoo AutosCarDB( Make, Model, Year, Price, Mileage, Location, Color)

• Populated using 100k tuples from Yahoo Autos

– Census Database from UCI Machine Learning Repository• Populated using 45k tuples

Algorithms – AIMQ

• RandomRelax – randomly picks attribute to relax• GuidedRelax – uses relaxation order determined using approximate keys and

AFDs

– ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)• Compute Neighbours and Links between every tuple

Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples

• Cluster tuples having common neighbours

http://www.asu.edu/


Robustness of Dependencies

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Model Color Year Make

Dependent Attribute

Depe

nden

ce .

100k50k25k15k

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 6 11 16 21 26

Keys

Quali

ty .

100k50k25k15k

Attribute dependence order & Key quality is unaffected by sampling

http://www.asu.edu/


Robustness of Value Similarities

Value Similar Values

25K 100k

Make=“Kia” Hyundai 0.17 0.17

Isuzu 0.15 0.15

Subaru 0.13 0.13

Make=“Bronco”

Aerostar 0.19 0.21

F-350 0 0.12

Econoline Van

0.11 0.11

Year=“1985” 1986 0.16 0.16

1984 0.13 0.14

1987 0.12 0.12

http://www.asu.edu/


Efficiency of Relaxation

0

100

200

300

400

500

600

700

800

900

1 2 3 4 5 6 7 8 9 10

Queries

Wor

k/Re

leva

nt T

uple

Є= 0.7

Є = 0.6

Є = 0.5

•Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7.

•Not resilient to change in Є

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10Queries

Wor

k/R

elev

ant T

uple

Є = 0.7

Є = 0.6

Є = 0.5

•Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7.

•Resilient to change in Є

Random Relaxation Guided Relaxation

http://www.asu.edu/


Accuracy over CarDB

•14 queries over 100K tuples

• Similarity learned using 25k sample

• Mean Reciprocal Rank (MRR) estimated as

• Overall high MRR shows high relevance of suggested answers

1|)()(|

1)(

ii tAIMQRanktUserRankAvgQMRR

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14Queries

Avera

ge M

RR .

GuidedRelax

RandomRelax

ROCK

http://www.asu.edu/


Accuracy over CensusDB

• 1000 randomly selected tuples as queries

• Overall high MRR for AIMQ shows higher relevance of suggested answers0.55

0.65

0.75

0.85

Top-10 Top-5 Top-3 Top-1Similar Answers

Avg

Qry-

Tupl

e Clas

s Sim

ilarit

y

AIMQ

ROCK

http://www.asu.edu/


AIMQ - Summary

An approach for answering imprecise queries over Web database– Mine and use AFDs to determine attribute order– Domain independent semantic similarity estimation

technique– Automatically compute attribute importance scores

Empirical evaluation shows – Efficiency and robustness of algorithms– Better performance than current approaches– High relevance of suggested answers– Domain independence

http://www.asu.edu/


http://www.asu.edu/

answering imprecise queries over autonomous web databases ullas nambiar dept. of computer science...

Documents