answering imprecise queries over autonomous web databases ullas nambiar dept. of computer science...

21
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer Science Arizona State University 5 th April, ICDE 2006, Atlanta, USA

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web

Databases

Ullas NambiarDept. of Computer Science

University of California, Davis

Subbarao Kambhampati

Dept. of Computer ScienceArizona State University

5th April, ICDE 2006, Atlanta, USA

Page 2: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Dichotomy in Query Processing

Databases

• User knows what she wants

• User query completely expresses the need

• Answers exactly matching query constraints

IR Systems

• User has an idea of what she wants

• User query captures the need to some degree

• Answers ranked by degree of relevance

Page 3: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Why Support Imprecise Queries ?

Want a ‘sedan’ priced around $7000

A Feasible Query

Make =“Toyota”, Model=“Camry”,

Price ≤ $7000

What about the price of a Honda Accord?

Is there a Camry for $7100?

Solution: Support Imprecise Queries

………

1998$6500CamryToyota

2000$6700CamryToyota

2001$7000CamryToyota

1999$7000CamryToyota

Page 4: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Others are following …

Page 5: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user.

Ans(Q) ={x|x Є R, Relevance(Q,x) >c}Objectives

– Minimal burden on the end user – No changes to existing database – Domain independent

Motivation– How far can we go with relevance model estimated from

database ?• Tuples represent real-world objects and relationships

between them – Use the estimated relevance model to provide a ranked

set of tuples similar to the query

What does Supporting Imprecise Queries Mean?

Page 6: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Challenges

Estimating Query-Tuple Similarity– Weighted summation of attribute

similarities– Need to estimate semantic similarity

Measuring Attribute Importance– Not all attributes equally important– Users cannot quantify importance

Page 7: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Our Solution: AIMQ

ImpreciseQuery

Q

Query Engine

Map: Convert“like” to “=”

Qpr = Map(Q)

Dependency Miner

Use Base Set as set ofrelaxable selection

queries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Similarity Miner

Use Value similaritiesand attribute

importance to measuretuple similarities

Prune tuples belowthreshold

Return Ranked Set

Query Engine

Derive BaseSet Abs

Abs = Qpr(R)

Page 8: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

An Illustrative Example

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Relation:- CarDB(Make, Model, Price, Year) Imprecise query

Q :− CarDB(Model like “Camry”, Price like “10k”)

Base query

Qpr :− CarDB(Model = “Camry”, Price = “10k”)

Base set Abs

Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”

Page 9: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Obtaining Extended Set

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Problem: Given base set, find tuples from database similar to tuples in base set.

Solution: – Consider each tuple in base set as a selection query.

e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”

– Relax each such query to obtain “similar” precise queries.e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”

– Execute and determine tuples having similarity above some threshold.

Challenge: Which attribute should be relaxed first?

– Make ? Model ? Price ? Year ?

Solution: Relax least important attribute first.

Page 10: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Least Important Attribute Definition: An attribute whose binding value when

changed has minimal effect on values binding other attributes.

– Does not decide values of other attributes– Value may depend on other attributes

E.g. Changing/relaxing Price will usually not affect other attributesbut changing Model usually affects Price

Requires dependence between attributes to decide relative importance

– Attribute dependence information not provided by sources– Learn using Approximate Functional Dependencies & Approximate

Keys • Approximate Functional Dependency (AFD)

X A is a FD over r’, r’ ⊆ rIf error(X A ) = |r-r’|/ |r| < 1 then X A is a AFD over r.

• Approximate in the sense that they are obeyed by a large percentage (but not all) of the tuples in the database

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

TANE- an algorithm by Huhtala et al [1999] used to mine AFDs and Approximate Keys

• Exponential in the number of attributes

• Linear in the number of tuples

Page 11: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Deciding Attribute Importance Mine AFDs and Approximate

Keys Create dependence graph using

AFDs– Strongly connected hence a

topological sort not possible Using Approximate Key with

highest support partition attributes into

– Deciding set– Dependent set– Sort the subsets using

dependence and influence weights

Measure attribute importance as

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

CarDB(Make, Model, Year, Price)

Decides: Make, YearDepends: Model, Price

Order: Price, Model, Year, Make

1- attribute: { Price, Model, Year, Make}

2-attribute: {(Price, Model), (Price, Year), (Price, Make).. }

•Attribute relaxation order is all non-keys first then keys

•Greedy multi-attribute relaxation

depends

idepends

decides

idecides

iimp

Wt

AWt

or

Wt

AWt

RAttributescount

AlaxOrderAiW

)(

)(

))((

)(Re)(

Page 12: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding

tuples in base set using

– n = Count(Attributes(R)) and Wimp is the importance weight of the attribute

– Euclidean distance as similarity for numerical attributes e.g. Price, Year

– VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model

NumericalAiDomif

AiQ

AitAiQ

lCategoricaAiDomif

AitAiQVSim

AiWtQSimn

i

imp

)(

.

|..|

)(

).,.(

)(),(1

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Page 13: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Categorical Value Similarity Two words are semantically

similar if they have a common context – from NLP

Context of a value represented as a set of bags of co-occurring values called Supertuple

Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs

– Measured as the Jaccard Similarity among supertuples representing the values

ST(QMake=Toy

ota)

Model Camry: 3, Corolla: 4,….

Year 2000:6,1999:5 2001:2,……

Price 5995:4, 6500:3, 4000:6

Supertuple for Concept Make=Toyota

JaccardSim(A,B) = BABA

m

i

imp AivSTAivSTJaccardSimAiWvvVSim1

)).2(,).1(()()2,1(

ImpreciseQuery

Q Map: Convert“like” to “=”

Qpr = Map(Q)

Use Base Set as set ofrelaxable selectionqueries

Using AFDs findrelaxation order

Derive Extended Set byexecuting relaxed queries

Use Concept similarityto measure tuplesimilarities

Prune tuples belowthreshold

Return Ranked Set

Derive BaseSet Abs

Abs = Qpr(R)

Page 14: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Empirical Evaluation Goal

– Test robustness of learned dependencies– Evaluate the effectiveness of the query relaxation and similarity

estimation

Database– Used car database CarDB based on Yahoo AutosCarDB( Make, Model, Year, Price, Mileage, Location, Color)

• Populated using 100k tuples from Yahoo Autos

– Census Database from UCI Machine Learning Repository• Populated using 45k tuples

Algorithms – AIMQ

• RandomRelax – randomly picks attribute to relax• GuidedRelax – uses relaxation order determined using approximate keys and

AFDs

– ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)• Compute Neighbours and Links between every tuple

Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples

• Cluster tuples having common neighbours

Page 15: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Robustness of Dependencies

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Model Color Year Make

Dependent Attribute

Depe

nden

ce .

100k50k25k15k

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 6 11 16 21 26

Keys

Quali

ty .

100k50k25k15k

Attribute dependence order & Key quality is unaffected by sampling

Page 16: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Robustness of Value Similarities

Value Similar Values

25K 100k

Make=“Kia” Hyundai 0.17 0.17

Isuzu 0.15 0.15

Subaru 0.13 0.13

Make=“Bronco”

Aerostar 0.19 0.21

F-350 0 0.12

Econoline Van

0.11 0.11

Year=“1985” 1986 0.16 0.16

1984 0.13 0.14

1987 0.12 0.12

Page 17: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Efficiency of Relaxation

0

100

200

300

400

500

600

700

800

900

1 2 3 4 5 6 7 8 9 10

Queries

Wor

k/Re

leva

nt T

uple

Є= 0.7

Є = 0.6

Є = 0.5

•Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7.

•Not resilient to change in Є

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10Queries

Wor

k/R

elev

ant T

uple

Є = 0.7

Є = 0.6

Є = 0.5

•Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7.

•Resilient to change in Є

Random Relaxation Guided Relaxation

Page 18: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Accuracy over CarDB

•14 queries over 100K tuples

• Similarity learned using 25k sample

• Mean Reciprocal Rank (MRR) estimated as

• Overall high MRR shows high relevance of suggested answers

1|)()(|

1)(

ii tAIMQRanktUserRankAvgQMRR

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14Queries

Avera

ge M

RR .

GuidedRelax

RandomRelax

ROCK

Page 19: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

Accuracy over CensusDB

• 1000 randomly selected tuples as queries

• Overall high MRR for AIMQ shows higher relevance of suggested answers0.55

0.65

0.75

0.85

Top-10 Top-5 Top-3 Top-1Similar Answers

Avg

Qry-

Tupl

e Clas

s Sim

ilarit

y

AIMQ

ROCK

Page 20: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases

AIMQ - Summary

An approach for answering imprecise queries over Web database– Mine and use AFDs to determine attribute order– Domain independent semantic similarity estimation

technique– Automatically compute attribute importance scores

Empirical evaluation shows – Efficiency and robustness of algorithms– Better performance than current approaches– High relevance of suggested answers– Domain independence

Page 21: Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati

Answering Imprecise Queries over Autonomous Web Databases