Query Translation of Web Database Integration: Issues,
Advances and Directions
Fangjiao Jiang
Outline
Query translation in web database integration
Introduction Problems A simple frameworkSurvey the current worksThe Challenges and opportunities of
query translation in Web DB integrationOur future works of query translation
Outline
Query translation in web database integration
Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of
query translation in Web DB integrationOur future works of query translation
Introduction of query translation
… …
WDB m
Database SelectionDatabase Selection
Query DispatchQuery Dispatch
Entity IdentificationEntity Identification
Result Extraction Result Extraction
WDB 1
User query
World Wide Web
Web
Web DatabaseDiscovery
Web DatabaseDiscovery
WDB List
WDB InterfaceSchema
Extraction
WDB InterfaceSchema
Extraction
Query Processing Module.
Result
WDB ClusteringBy Domain
WDB ClusteringBy Domain
WDB Cluster 1
WDB Cluster n
. . . . . .Interface Integration
Interface Integration
Integrated Interface 1
Integrated Interface n
. . . . . .
Integrated Interface Generation Module.
Integrated Interface
Result AnnotationResult Annotation
Domain MappingDomain Mapping Result MergingResult Merging
query translation
Query translation A user’s query submitted to the integrated interface must be translated to web
database interfaces automatically.
Q-Web DB
ProblemsProblem1:
Should we translate the query to every web database?
Necessary? costly? redundant?
What? What web databases should we
select to translate the user’s query? Database selection?
DB1
DB2 DB3 DBn……
query
Web DBs
Q
Q
Q
ProblemsProblem2: 1-1 match
Complex match {Depart City}={leaving from} 1:1 {Destination}={Going to} 1:1 …… {Adult, Child}={Passengers} n:1
How? How to translate a query from integrated interface to a web database
interface? Attribute matching?
Integrated interface A local web DB interface
ProblemsProblem3:
Title contains “red storm” Title contains “red storm” (any words) or Title contains “red storm” (all words) or Title contains “storm” (any words) Price<$35 Price<$25 U $25<Price<$45 …… How? How to translate a query from integrated interface to a web database interface? Constraint mapping?
Integrated interface
local web DB interface2
local web DB interface1
ProblemsProblem4:
Author, Title could be queried together. Author, Title, Subject could only be queried one of them at a time. How? How to translate a query from integrated interface to a web database interface? Capability-based query rewriting?
Class=Economy There is no attribute in the web database interface that match the attribute “class”, so Class=true?
Integrated interface A local web DB interface
Integrated interfaceA local web DB interface
ProblemsProblem5:
It is unavoidable that some of returned results are incorrect.
How? How to filter the returned results to get the correct results? Result filter?
Correct results
Returned results
The simple framework of query translation
Database selection
Attribute matching
Query rewriting
Constraint mapping
Result filter
pre--processing
core--processing
post--processing
Outline
Query translation in web database integration
Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of
query translation in Web DB integrationOur future works of query translation
Survey the current work (Ⅰ) Database selection A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiq
ing Nie and Subbarao Kambhampati. In Proceedings of the 20th ICDE 2004. A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sour
ces using source descriptions. In VLDB Conference, 1996. A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.
294, 2001. ……
Attributes matching E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching.
VLDB Journal, 10(4):334–350, 2001 Bin He.Discovering Complex Matchings across Web Query Interfaces: A Correlation
Mining Approach.KDD,2004 B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces.
In SIGMOD Conference, 2003. Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma.Instance-based Schema M
atching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004.
Bin He, Kevin Chen-Chuan Chang: Making holistic schema matching robust: an ensemble approach. KDD 2005, 429-438.
……
Survey the current work (Ⅱ) Constraint mapping Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Que
rying Web Databases On the Fly, In VLDB Conference, 2005. K. C.-C. Chang and H. Garc´a-Molina. Approximate query mapping: Accounting for tr
anslation closeness. VLDB Journal 2001. K. C.-C. Chang, H. Garc´a-Molina, and A. Paepcke. Boolean Query Mapping Across
Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996.
K. C.-C. Chang, H. Garc´a-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference.
…… Query rewriting C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and
M. Valiveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in m
ediator systems. In International Conference on Parallel and Distributed Information Systems, 1996.
Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation ofwrappers. In International Conference on Deductive and Object-Oriented Databases, 1995.
…… Result filter ……
Related works(1)—Database selection:A Frequency-based Approach for Mining Coverage Statistics in Data Integration. (ICDE 2004)
BibFinder Scenario — A publicly fielded computer science bibliography mediator. It integrates several online Computer Science bibliography sources, such as
CSB, DBLP, ACM Digital Library, CiteSeer, and so on. Approach
Use Coverage and Overlap Statistics to Rank Sources.
Need to know the coverage of each source
S with respect to the query Q.
Learn statistics only with respect A smaller set of frequently asked queries
Only need to know the coverage of each source S
with respect to the frequent query Q.
Store statistics with respect to query class The new query which is
not in the query list can be mapped into most similar query classes
AV Hierarchies and query classes
Query ListQuery Frequency Distinctive
Answers
Overlap (Coverage)
DBLP 35
CSB 23
CSB, DBLP 12
DBLP, Science 3
Science 3
CSB, DBLP, Science 1
Author=”andy king” 106 46
CSB, Science 1
CSB 16
DBLP 16
CSB, DBLP 7
ACMdl 5
ACMdl, CSB 3
ACMdl, DBLP 3
ACMdl, CSB, DBLP 2
Author=”fayyad”
Title=”data mining”
1 27
Science 1
Query List: the mediator maintains an XML log of all user queries, along with their access frequency, number of total distinct answers obtained, and number of answers from each source set which has answers for the query.
AV Hierarchies and Query Classes
RT
2001 2002
AV Hierarchy for the Year Attribute
AI
SIGMOD ICDE AAAI ECP
RT
DB
AV Hierarchy for the Conference Attribute
RT,02 AI,RT
SIGMOD,RT ICDE,RT DB,02 AAAI,RT AI,01 ECP,RT
RT,01
SIGMOD01 ICDE02ICDE01 AAAI01
DB,01
ECP01
RT,RT
DB,RT
Query Class Hierarchy
Query Class: queries are grouped intoclasses by computing cartesianproducts over the AV Hierarchies.A query class is a set of queries thatall share a set of assignments ofparticular attributes to specific values.
Attribute-Value Hierarchy:An AV Hierarchy is a classification of the values of a particular attribute of the mediator relation. Leaf nodes in the hierarchy correspond to concrete values bound in a query.
… …
Using Coverage and Overlap Statistics to Rank Sources
1. A new user query is mapped to a set of least general query classes.
2. The mediator estimates the statistics for the query using a weighted sum of the statistics of the mapped classes.
3. Data sources are ranked and called in order of relevance using the estimated statistics.In particular:
- The most relevant source has highest coverage- The next best source has highest residual
coverage
As a result, the maximum number of tuples are obtained while the least number of sources are called.
DBLP
CSB
ACMDL
Example:Here, CSB has highest coverage, followed by DBLP. However, since ACMDL has higher residual coverage than DBLP, the top 2 sources that would be called are CSB and ACMDL.
Related works(2)—Database selection: Query heterogeneous information sources using source description( VLDB 1996) Information Manifold an implemented system that provides uniform access to a heterogeneous collection
of more than 100 information sources on the WWW. IM contains declarative description of the contents of the information sources.Example: Q: Get the price and review of cars for sale that were manufactured no earlier than 1992.
Use the relational model, augmented with certain object-oriented features to describe the content of information sources.
Related works(3): constraint mapping: Light-weight domain-based form assistant: Querying Web Databases On the Fly (VLDB 2005)
semantic closeness Definition 1: Given a source query Qs and a target query form T, a query Qt* is a minimal
subsuming translation w.r.t. T if: 1. Qt* is a valid query w.r.t. T; 2. Qt* subsumes Qs, i.e., for any database instance Di, Qs(Di) ∑Qt* (Di); 3. Qt* is minimal, i.e., there is no query Qt such that Qt satisfies (1) and (2) and Qt* subs
umes Qt.Approach: find 37 template patterns in 150 sources. notice that two predicate templates have mapping correspondence only if there exists a c
oncept expressed with these two templates in different sources. CM (i, j) denote the number of concepts that are expressed using both templates Pi and Pj in the correspondence matrix CM.
As Figure 5 indicates, mappings happen mostly only within certain clusters of templates.
Datetime type Numeric type Text type
Predicate mapper consists of two components: type recognizer and type handler
predicate mapper takes a source predicate s and a matched target predicate template P as input, and outputs the closest target translation t* for s
a type handler needs to have three key components: search space, closeness estimation, and search strategy.
Closeness Estimation Given the search space (P) covering all possible mappin
gs, finding a Cmin mapping for numeric type and Datetime type is an easy task.
for text type, the inference of sub-sumption relationship is not trivial since it essentially needs logical reasoning.
Text Type Handler: The question is which database instance can be used to reliably tes
t the sub-sumption relationship? we construct the database using words from Ws plus some addition
al random words. The database is composed of all possible combinations of the words (for testing the membership) with all possible orders (for testing the sequence). Figure8: t5 is the Cmin mapping.
Related works(4)—constraint mapping:Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. SIGMOD 1999
Method -----Provide manually mapping rules to translate query constraints from one to
another sources consider one-to-one constraints mapping Consider inter-dependencies among constraints Explore relaxations into the closest supported version
Related works(5)—query rewriting: Query heterogeneous information sources using source description( VLDB 1996) Information Manifold an implemented system that provides uniform access to a heterogeneous
collection of more than 100 information sources on the WWW. IM contains declarative description of the contents and capabilities of the
information sources. use the source description to prune the set of information sources for a
given query and generate executable query plans.
Use the capability records to describe the capabilities of an information sources.
Every source relation associate one capability record of the form (Sin, Sout, Ssel, min, max)
Sin---must be given bindings for at least min elements of Sin, Sout---the parameters returned from the information sources, Ssel---the parameters that can apply selections of the form, Min---the minimum number of inputs allowed, Max--- the maxmum number of inputs allowed.
Related works(6)—query rewriting: Capability based Mediation in TSIMMIS (SIGMOD 1998)
keep track of the capabilities of sources to answer queries. This may not lead to generate plans involving source queries that cannot be answered by the sources.
TSIMMIS system: The mediator encodes the relationship between the user views and the sour
ce views with a set of view definitions.
uses the Mediator Specification Language (MSL) to define user views. MSL is a logic-based language with object-oriented features.
For example, the user view paper is defined as follows: <paper (<title T><author A><abs B><conf C>}> :- <entry {<title T><author A><abs B>}>Qs1, <entry {<title T><conf C>}>QS2
Suppose the user wants to find the title and abstract of each paper written by ‘Smith’ in ‘SIGMOD-97’. The user formulates the following query, based on the user view paper:
<ans {<title T><abs B>}> :- <paper {<title T><author ‘Smith’><abs B><conf ‘SIGMOD-97’>}>
When the user query arrives at the mediator, the mediator uses the view definitions to translate the query on the user views into a logical plan. The following is the logical plan for the example
user query: <ans {<title T><abs B>}> :- <entry {<title T><author ‘Smith’><abs B>}>Qsl, <entry {<title T><conf ‘SIGMOD-97’>}>Qs2Three possible physical plans for the logical plan of the example user query are:
PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute.
P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound.
Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound.
Some of these physical plans may or may not be feasible depending on the query capabilities of the sources,
In order to describe the capabilities of sources, the TSIMMIS system uses templates to represent sets of queries that can be processed by each source.
Suppose s1 and s2 only have the following templates.
templates. TII : X:-X:<entry {<title $T><author A><abs B>}>Qs1 T21 : X:-X:<entry {<title T><conf $C>}>Qs2 T22 : X:-X:<entry {<title $T><conf C>}>Qs2
PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute.
P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound.
Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound.
Outline
Query translation in web database integration
Introduction Problems A simple frameworkSurvey the current workThe Challenges and opportunities of
query translation in Web DB integrationOur future work of query translation
Challenges How can we translate the query from the uniform integrated interface to web
database interfaces? The number of web databases we can access is very large even if in one
domain. The meta-information about the web databases is very difficult to access.
Logical source contents (books, new cars) Source capabilities (can answer the query) Source completeness (has all books) Statistics about the data (like in an RDBMS) Source reliability Update frequency The web databases are heterogeneous. with heterogeneous schema The web databases are autonomous. No central administration Uncontrolled source content overlap The web databases are dynamic. Approximate query translations will be unavoidable and complex. manually rule-based constraints mapping will be replaced by automatically
rule-based one.
Opportunities The aggregate schema vocabul
ary of sources in the same domain trends to converge at a relatively small size.
The distributions of attribute frequencies is non-uniform Zipf-like.
There are 25 constraints patterns overall.
The distributions of constraints patterns is Zipf-like, too.
Data-model is simple.
Some related works of schema matching must be useful to query translation.
Outline
Query translation in web database integration
Introduction Problems A simple frameworkSurvey the current workThe differences and Challenges of query
translation in Web DB integration Our future works of query translation
Our framework of query translation
Database selectionDescription of the contents, completeness, capabilities reliabilities of each web
database
Attribute matchingCreate a hierarchy relationship
with respect to semantic matching
Query rewritingDescription capabilities of each
web database
Constraint mappingGenerate mapping rules
automatically according to the
types
Result filter
Probing Queries
Statistics information
Semantic matching
Generate mapping rules
Description capabilities
Questions
How do we characterize, get and exploit source content, completeness, reliability, coverage and overlap?
How to create a hierarchy relationship with respect to the semantic mapping of attributes?
How to generate constraints mapping rules automatically according to the semantic mapping and type of attributes?
How to get and describe the capabilities of the local interfaces? How to rewrite query based on them?
Thanks!
Questions
Main references A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources u
sing source descriptions. In VLDB Conference, 1996. A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.294,
2001. C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and M. V
aliveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in mediato
r systems. In International Conference on Parallel and Distributed Information Systems, 1996.
Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation of wrappers. In International Conference on Deductive and Object-Oriented Databases, 1995.
Vasilis Vassalos, Yannis Papakonstantinou: Describing and Using Query Capabilities of Heterogeneous Sources. VLDB 1997: 256-265
K. C.-C. Chang and H. Garc´a-Molina. Approximate query mapping: Accounting for translation closeness. VLDB Journal 2001.
K. C.-C. Chang, H. Garc´a-Molina, and A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996.
K. C.-C. Chang, H. Garc´a-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference.
Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web Databases On the Fly, In VLDB Conference, 2005.
A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiqing Nie and Subbarao Kambhampati. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004).
Main references K. C.-C. Chang and H. Garc´a-Molina. Conjunctive constraint mapping for data translation.
ACM ICDL,1998 Bin He, Kevin Chen-Chuan Chang. Making holistic schema matching robust: an ensemble
approach. KDD 2005, 429-438. Wensheng Wu, AnHai Doan, Clement Yu.WebIQ: Learning from the Web to Match Deep-
Web Query Interfaces.ICDE,2006. B. He, K. C.-C. Chang, and J. Han. Automatic complex schema matching across web quer
y interfaces: A correlation mining approach. Technical Report UIUCDCS-R-2003-2388, Dept. of Computer Science, UIUC, July 2003.
K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004.
Palopoli L, Sacca D, Ursino D. Semi-automatic, semantic discovery of properties from database schemas. IDEAS,1998.
K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004
Milo T, Zohar S,Using schema matching to simplify heterogeneous data translation. Proc24th VLDB
Bin He. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach.KDD,2004.
Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004.
Hai He, Weiyi Meng, Clement Yu, and Zonghuan Wu. Automatic Integration of Web Search Interfaces with WISE-Integrator . VLDB 2003.
……