answering imprecise queries over autonomous web databases ullas nambiar dept. of computer science...
Post on 21-Dec-2015
216 views
TRANSCRIPT
Answering Imprecise Queries over Autonomous Web
Databases
Ullas NambiarDept. of Computer Science
University of California, Davis
Subbarao Kambhampati
Dept. of Computer ScienceArizona State University
5th April, ICDE 2006, Atlanta, USA
Answering Imprecise Queries over Autonomous Web Databases
Dichotomy in Query Processing
Databases
• User knows what she wants
• User query completely expresses the need
• Answers exactly matching query constraints
IR Systems
• User has an idea of what she wants
• User query captures the need to some degree
• Answers ranked by degree of relevance
Answering Imprecise Queries over Autonomous Web Databases
Why Support Imprecise Queries ?
Want a ‘sedan’ priced around $7000
A Feasible Query
Make =“Toyota”, Model=“Camry”,
Price ≤ $7000
What about the price of a Honda Accord?
Is there a Camry for $7100?
Solution: Support Imprecise Queries
………
1998$6500CamryToyota
2000$6700CamryToyota
2001$7000CamryToyota
1999$7000CamryToyota
Answering Imprecise Queries over Autonomous Web Databases
Others are following …
Answering Imprecise Queries over Autonomous Web Databases
The Problem: Given a conjunctive query Q over a relation R, find a set of tuples that will be considered relevant by the user.
Ans(Q) ={x|x Є R, Relevance(Q,x) >c}Objectives
– Minimal burden on the end user – No changes to existing database – Domain independent
Motivation– How far can we go with relevance model estimated from
database ?• Tuples represent real-world objects and relationships
between them – Use the estimated relevance model to provide a ranked
set of tuples similar to the query
What does Supporting Imprecise Queries Mean?
Answering Imprecise Queries over Autonomous Web Databases
Challenges
Estimating Query-Tuple Similarity– Weighted summation of attribute
similarities– Need to estimate semantic similarity
Measuring Attribute Importance– Not all attributes equally important– Users cannot quantify importance
Answering Imprecise Queries over Autonomous Web Databases
Our Solution: AIMQ
ImpreciseQuery
Q
Query Engine
Map: Convert“like” to “=”
Qpr = Map(Q)
Dependency Miner
Use Base Set as set ofrelaxable selection
queries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Similarity Miner
Use Value similaritiesand attribute
importance to measuretuple similarities
Prune tuples belowthreshold
Return Ranked Set
Query Engine
Derive BaseSet Abs
Abs = Qpr(R)
Answering Imprecise Queries over Autonomous Web Databases
An Illustrative Example
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Relation:- CarDB(Make, Model, Price, Year) Imprecise query
Q :− CarDB(Model like “Camry”, Price like “10k”)
Base query
Qpr :− CarDB(Model = “Camry”, Price = “10k”)
Base set Abs
Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2001”
Answering Imprecise Queries over Autonomous Web Databases
Obtaining Extended Set
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Problem: Given base set, find tuples from database similar to tuples in base set.
Solution: – Consider each tuple in base set as a selection query.
e.g. Make = “Toyota”, Model = “Camry”, Price = “10k”, Year = “2000”
– Relax each such query to obtain “similar” precise queries.e.g. Make = “Toyota”, Model = “Camry”, Price = “”, Year =“2000”
– Execute and determine tuples having similarity above some threshold.
Challenge: Which attribute should be relaxed first?
– Make ? Model ? Price ? Year ?
Solution: Relax least important attribute first.
Answering Imprecise Queries over Autonomous Web Databases
Least Important Attribute Definition: An attribute whose binding value when
changed has minimal effect on values binding other attributes.
– Does not decide values of other attributes– Value may depend on other attributes
E.g. Changing/relaxing Price will usually not affect other attributesbut changing Model usually affects Price
Requires dependence between attributes to decide relative importance
– Attribute dependence information not provided by sources– Learn using Approximate Functional Dependencies & Approximate
Keys • Approximate Functional Dependency (AFD)
X A is a FD over r’, r’ ⊆ rIf error(X A ) = |r-r’|/ |r| < 1 then X A is a AFD over r.
• Approximate in the sense that they are obeyed by a large percentage (but not all) of the tuples in the database
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
TANE- an algorithm by Huhtala et al [1999] used to mine AFDs and Approximate Keys
• Exponential in the number of attributes
• Linear in the number of tuples
Answering Imprecise Queries over Autonomous Web Databases
Deciding Attribute Importance Mine AFDs and Approximate
Keys Create dependence graph using
AFDs– Strongly connected hence a
topological sort not possible Using Approximate Key with
highest support partition attributes into
– Deciding set– Dependent set– Sort the subsets using
dependence and influence weights
Measure attribute importance as
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
CarDB(Make, Model, Year, Price)
Decides: Make, YearDepends: Model, Price
Order: Price, Model, Year, Make
1- attribute: { Price, Model, Year, Make}
2-attribute: {(Price, Model), (Price, Year), (Price, Make).. }
•Attribute relaxation order is all non-keys first then keys
•Greedy multi-attribute relaxation
depends
idepends
decides
idecides
iimp
Wt
AWt
or
Wt
AWt
RAttributescount
AlaxOrderAiW
)(
)(
))((
)(Re)(
Answering Imprecise Queries over Autonomous Web Databases
Query-Tuple Similarity Tuples in extended set show different levels of relevance Ranked according to their similarity to the corresponding
tuples in base set using
– n = Count(Attributes(R)) and Wimp is the importance weight of the attribute
– Euclidean distance as similarity for numerical attributes e.g. Price, Year
– VSim – semantic value similarity estimated by AIMQ for categorical attributes e.g. Make, Model
NumericalAiDomif
AiQ
AitAiQ
lCategoricaAiDomif
AitAiQVSim
AiWtQSimn
i
imp
)(
.
|..|
)(
).,.(
)(),(1
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Answering Imprecise Queries over Autonomous Web Databases
Categorical Value Similarity Two words are semantically
similar if they have a common context – from NLP
Context of a value represented as a set of bags of co-occurring values called Supertuple
Value Similarity: Estimated as the percentage of common {Attribute, Value} pairs
– Measured as the Jaccard Similarity among supertuples representing the values
ST(QMake=Toy
ota)
Model Camry: 3, Corolla: 4,….
Year 2000:6,1999:5 2001:2,……
Price 5995:4, 6500:3, 4000:6
Supertuple for Concept Make=Toyota
JaccardSim(A,B) = BABA
m
i
imp AivSTAivSTJaccardSimAiWvvVSim1
)).2(,).1(()()2,1(
ImpreciseQuery
Q Map: Convert“like” to “=”
Qpr = Map(Q)
Use Base Set as set ofrelaxable selectionqueries
Using AFDs findrelaxation order
Derive Extended Set byexecuting relaxed queries
Use Concept similarityto measure tuplesimilarities
Prune tuples belowthreshold
Return Ranked Set
Derive BaseSet Abs
Abs = Qpr(R)
Answering Imprecise Queries over Autonomous Web Databases
Empirical Evaluation Goal
– Test robustness of learned dependencies– Evaluate the effectiveness of the query relaxation and similarity
estimation
Database– Used car database CarDB based on Yahoo AutosCarDB( Make, Model, Year, Price, Mileage, Location, Color)
• Populated using 100k tuples from Yahoo Autos
– Census Database from UCI Machine Learning Repository• Populated using 45k tuples
Algorithms – AIMQ
• RandomRelax – randomly picks attribute to relax• GuidedRelax – uses relaxation order determined using approximate keys and
AFDs
– ROCK: RObust Clustering using linKs (Guha et al, ICDE 1999)• Compute Neighbours and Links between every tuple
Neighbour – tuples similar to each other Link – Number of common neighbours between two tuples
• Cluster tuples having common neighbours
Answering Imprecise Queries over Autonomous Web Databases
Robustness of Dependencies
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Model Color Year Make
Dependent Attribute
Depe
nden
ce .
100k50k25k15k
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 6 11 16 21 26
Keys
Quali
ty .
100k50k25k15k
Attribute dependence order & Key quality is unaffected by sampling
Answering Imprecise Queries over Autonomous Web Databases
Robustness of Value Similarities
Value Similar Values
25K 100k
Make=“Kia” Hyundai 0.17 0.17
Isuzu 0.15 0.15
Subaru 0.13 0.13
Make=“Bronco”
Aerostar 0.19 0.21
F-350 0 0.12
Econoline Van
0.11 0.11
Year=“1985” 1986 0.16 0.16
1984 0.13 0.14
1987 0.12 0.12
Answering Imprecise Queries over Autonomous Web Databases
Efficiency of Relaxation
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10
Queries
Wor
k/Re
leva
nt T
uple
Є= 0.7
Є = 0.6
Є = 0.5
•Average 8 tuples extracted per relevant tuple for Є =0.5. Increases to 120 tuples for Є=0.7.
•Not resilient to change in Є
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10Queries
Wor
k/R
elev
ant T
uple
Є = 0.7
Є = 0.6
Є = 0.5
•Average 4 tuples extracted per relevant tuple for Є=0.5. Goes up to 12 tuples for Є= 0.7.
•Resilient to change in Є
Random Relaxation Guided Relaxation
Answering Imprecise Queries over Autonomous Web Databases
Accuracy over CarDB
•14 queries over 100K tuples
• Similarity learned using 25k sample
• Mean Reciprocal Rank (MRR) estimated as
• Overall high MRR shows high relevance of suggested answers
1|)()(|
1)(
ii tAIMQRanktUserRankAvgQMRR
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14Queries
Avera
ge M
RR .
GuidedRelax
RandomRelax
ROCK
Answering Imprecise Queries over Autonomous Web Databases
Accuracy over CensusDB
• 1000 randomly selected tuples as queries
• Overall high MRR for AIMQ shows higher relevance of suggested answers0.55
0.65
0.75
0.85
Top-10 Top-5 Top-3 Top-1Similar Answers
Avg
Qry-
Tupl
e Clas
s Sim
ilarit
y
AIMQ
ROCK
Answering Imprecise Queries over Autonomous Web Databases
AIMQ - Summary
An approach for answering imprecise queries over Web database– Mine and use AFDs to determine attribute order– Domain independent semantic similarity estimation
technique– Automatically compute attribute importance scores
Empirical evaluation shows – Efficiency and robustness of algorithms– Better performance than current approaches– High relevance of suggested answers– Domain independence
Answering Imprecise Queries over Autonomous Web Databases