query processing over incomplete autonomous databases

23
Query Processing over Query Processing over Incomplete Autonomous Databases Incomplete Autonomous Databases Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati Arizona State University 2008-02-04 Summerized By Sungchan Park

Upload: goldy

Post on 21-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Query Processing over Incomplete Autonomous Databases. Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati Arizona State University 2008-02-04 Summerized By Sungchan Park. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query Processing over  Incomplete Autonomous Databases

Query Processing over Query Processing over Incomplete Autonomous DatabasesIncomplete Autonomous Databases

Presented By Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati

Arizona State University

2008-02-04

Summerized By Sungchan Park

Page 2: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

IntroductionIntroduction

More and more data is becoming accessible via web servers which are supported by backend autonomous databases

E.g. Cars.com, Realtor.com, Google Base, Etc.

Center for E-Business Technology

AutonomousDatabase

AutonomousDatabase

AutonomousDatabase

Mediator

Page 3: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Web DB.s are Incomplete!Web DB.s are Incomplete!

Incomplete Entry

Inaccurate Extraction

Heterogeneous Schemas

User-Defined Schemas

Center for E-Business Technology

Page 4: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

ProblemProblem

Current autonomous database systems only return certain answers, namely those which exactly satisfy all the user query constraints

Although there has been work on handling incompleteness in databases, much of it has been focused on single databases on which the query processor has complete control.

Modify databases directly by replacing null values with likely values.

– Not applicable to autonomous databases

Center for E-Business Technology

Page 5: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Possible Naïve ApproachesPossible Naïve Approaches

Query Q: (Body Style = Convt)

CERTAINONLY

Return only certain answer

– Low Recall

ALLRETURNED

Return all answer having Body Style = Convt or Body Style = Null

– Low Precision, Infeasible

ALLRANKED

Return all answers having Body Style = Convt. Additionally, rank all answers having body style as null by predicting the missing values and return them to the user

– Costly, Infeasible

Center for E-Business Technology

Page 6: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

QPIADQPIAD

Solved the problem by generating rewritten queries according to a set of mined attribute correlation rules.

Approximate Functional Dependency(AFD)

Naïve Bayesian Classifier

Center for E-Business Technology

Page 7: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

QPIAD SolutionQPIAD Solution

Center for E-Business Technology

Page 8: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

QPIAD ArchitectureQPIAD Architecture

Center for E-Business Technology

Page 9: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Overall ProcessOverall Process

1. Learn

2. Rewrite

3. Rank

4. Explain

Center for E-Business Technology

Page 10: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#1. Learn - AFD#1. Learn - AFD

Learn Attribute Correlations

Approximate Functional Dependencies(AFD)

Approximate Keys(Akeys)

– For pruning

Learn by TANE algorithm

Y. Huhtala, et al. Efficient discovery of functional and approximate dependencies using partition. 1998.

Pruning example

AFD {A1, A2} ~> A3

Akey {A1}

Center for E-Business Technology

Page 11: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#1. Learn - Naïve Bayesian Classifier#1. Learn - Naïve Bayesian Classifier

Learn Value distribution by NBC

Using mined AFD as selected feature

E.g.

– AFD {Make, Body} ~> Model

– P(Model = Accord | Make = Honda, Body = Coupe) = ?

Center for E-Business Technology

Page 12: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#1. Learn - Selectivity#1. Learn - Selectivity

SmplSel(Q)*SmplRatio(R)*PerInc(R)

SmplSel(Q) = Selectivity of rewritten query issued on sample

SmplRatio(R) = Ratio of original database size over sample

PerInc(R) = Percent of incomplete tuples while creating sample

Center for E-Business Technology

Page 13: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#2. Rewrite#2. Rewrite

1. Get base result(Certain answers)

2. Generate rewritten queries by base result and learned AFD

Center for E-Business Technology

Rewritten Queries

Page 14: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#3. Rank #3. Rank

1. Select top-k queries based on F-Measure

2. Reorder selected query based on P

3. Retrieve tuples

Center for E-Business Technology

P = learned Prob.R = selectivity

Page 15: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

#4. Explain#4. Explain

Center for E-Business Technology

Page 16: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Other Issues: Correlated SourceOther Issues: Correlated Source

Center for E-Business Technology

Page 17: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Other Issues: Handling AggregationOther Issues: Handling Aggregation

Center for E-Business Technology

Page 18: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Empirical Evaluation: QualityEmpirical Evaluation: Quality

QPIAD vs. ALLRETURNED

ALLRETURNED has low precision because not all tuples with missing values on the constrained attributes are relevant to the query

QPIAD has a much higher precision than ALLRETURNED as it aims to retrieve tuples with missing values on the constrained attributes which are very likely to be relevant to the query

Center for E-Business Technology

Page 19: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Empirical Evaluation: EfficiencyEmpirical Evaluation: Efficiency

QPIAD vs. ALLRANKED

ALLRANKED approach is often infeasible as direct retrieval of null values is not often allowed

QPIAD is able to achieve the same level of recall as ALLRANKED while requiring much fewer tuples to be retrieved

Center for E-Business Technology

Page 20: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Empirical Evaluation: RobustnessEmpirical Evaluation: Robustness

Robustness w.r.t. Sample Size

QPIAD is robust even when face with a relatively small data sample

Center for E-Business Technology

Page 21: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Empirical Evaluation: ExtensionsEmpirical Evaluation: Extensions

Aggregates

Prediction of missing values increases the fraction of queries that achieve higher levels of accuracy

Approximately 20% more queries achieve 100% accuracy when prediction is used

Join

As alpha is increased, we obtain a higher recall without sacrificing much precision

Center for E-Business Technology

Page 22: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

Related WorkRelated Work

Querying Incomplete Databases Possible World Approaches – tracks the completions of incomplete tuples

(CoddTables, V-Tables, Conditional Tables)

Probabilistic Approaches – quantify distribution over completions to distinguish between likelihood of various possible answers

Probabilistic Databases Tuples are associated with an attribute describing the probability of its existence

However, in our work, the mediator does not have the capability to modify the underlying autonomous databases

Query Reformulation / Relaxation Aims to return similar or approximate answers to the user after returning or in the

absence of exact answers

Our focus is on retrieving tuples with missing values on constrained attributes

Learning Missing Values Common imputation approaches replace missing values by substituting the mean,

most common value, default value, or using kNN, association rules, etc.

Our work requires schema level dependencies between attributes as well as distribution information over missing values

Center for E-Business Technology

Page 23: Query Processing over  Incomplete Autonomous Databases

Copyright 2008 by CEBT

ContributionContribution

Efficiently retrieve relevant uncertain answers from autonomous sources given only limited query access patterns Query Rewriting

Retrieves answers with missing values on constrained attributes without modifying the underlying databases AFD-Enhanced Classifiers

Rewriting & ranking considers the natural tension between precision and recall F-Measure based ranking

AFDs play a major role in: Query Rewriting

Feature Selection

Explanations

Center for E-Business Technology