light-weight domain-based form assistant: querying web databases on the fly authors:z. zhang, b. he,...

30
Querying Web Databases On The Fly Authors: Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign) Published in: Proceedings of the 31 st VLDB Conference, Trondheim, Norway 2005 Presented by: Bruce Vincent CSE-718 Seminar April 25, 2008

Upload: cordelia-blankenship

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly

Authors: Z. Zhang, B. He, K. C.-C. Chang(Univ. of Illinois at Urbana-Champaign)

Published in: Proceedings of the 31st VLDB Conference,Trondheim, Norway 2005

Presented by: Bruce Vincent

CSE-718 Seminar

April 25, 2008

Page 2: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Outline Overview

Problem Description, Motivating Example System Architecture Design Approaches

Query Modeling and Translation Dynamic Predicate Mapping

Implementation - Form Assistant Toolkit Experiments Related Work

Page 3: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Problem Description

“Deep Web” Estimated to contain 450,000 online databases (2004) Sometimes referred to as “Invisible Web” or “Hidden Web”

Much of this is accessible only by query forms instead of static URL links Common domains such as: books, cars, airfares

Page 4: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Problem Description

Often it can be useful to query multiple alternative sources in the same domain Automation of this entails several components One key component is dynamic query translation Software toolkit “Form Assistant” designed to provide

potential translations of user queries for alternative sources e.g., User-entered Amazon form query automatically translated to

potential Barnes & Noble form query

Page 5: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Problem Description

Goals of query translator: Source-generality

Built-in translation must generally cope with new or “unseen” sources

Domain-portability Translator must be easily customizable with domain-specific

knowledge, and thus deployable for new domains

Page 6: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Motivating Example

Source query Qs on source form S:

(e.g. Amazon)

Target query form T:(e.g. Barnes & Noble)

Page 7: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Motivating Example

Source query Qs on source form S Target query form T

Tom ClancyTom Clancy

U

Query Translation

Filter: : σtitle contain “red storm” and price < 35 and age > 12

Union Query Qt*:

Page 8: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

System Architecture

Form Extractor Form Extractor

Source query Qs Target query form QI

Attribute Matcher:Syntax-based schema matching

Predicate Mapper:Type-based search-driven mapping

Query Rewriter:Constraint-based query rewriting

Target query Qt*

Domain-specificThesaurus

Domain-specific type handlers

FormAssistant

(FA)

Page 9: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Design Approaches

Query Modeling Vocabulary and Syntax

Query Translation Dynamic Predicate Modeling

Page 10: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Modeling Vocabulary

Predicate templates: { P1, P2, P3, P4, P5 } Example:

P1 P3

P4P2

P5

Page 11: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Modeling Example Vocabulary (predicate templates)

P1 = [author; contain; $au]

P2 = [title; contain; $ti]

P3 = [subject; contain; $su]

P4 = [isbn; contain; $isbn]

P5 = [price; between; $s, $e]

Example Syntax (valid conjunctive forms) F1 = P1 P5

F2 = P2 P5

F3 = P3 P5

F4 = P4 P5

F5 = P1

F6 = P2

F7 = P3

F8 = P4

Page 12: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Modeling Example Vocabulary Instantiations

p1 = [author; contain; Tom Clancy]

p2 = [title; contain; red storm]

p51 = [price; between; 0-25]

p52 = [price; between; 25-45]

Corresponding Form Queries: f1 = p1 p5

1

f2 = p1 p52

Resultant Union Query: Qt = f1 f2

Page 13: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Modeling Syntax

Valid combination of predicate templates {F1, F2, F3, F4, F5, F6, F7, F8 }

Example (‘v’ indicates ‘valid’):

F1 F2 F3 F4 F5 F6 F7 F8

P1

(author)

ν ν

P2

(title)

ν ν

P3

(subject)

ν ν

P4

(isbn)

ν ν

P5

(price)

v v v v

Tom Clancy

F1:

F2:

Page 14: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Translation Based on semantic closeness of query predicates:

Finds minimal subsuming Cmin

Benefits of this approach: No false positives Minimizes false negatives Has clear semantics, independent of DB content Modular translation

Page 15: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Translation

Example:

t1: 0

t2: 25 45

s: 350

t1 v t2: 0 45t3: 6545

t1 v t2 v t3: 0 65

? Cmin

25

Page 16: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Translation Definition:

Given source query Qs and target query form T, a query Qt* is a “minimal subsuming translation” w.r.t. T if:

1. Qt* is a validquery w.r.t T

2. Qt* subsumes Qs

i.e., for any database instance Di, Qs(Di) ≤ Qt*(Di)

3. Qt* is minimal

i.e., there is no query Qt such that Qt satisfies (1.) and (2.) above and Qt* subsumes Qt

Page 17: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Query Translation

Example: Consider source query Qs in first example and three target queries Qt1,Qt2,Qt3

Qt1 and Qt3 subsume Qs while Qt2 does not Misses price range 0-25 Thus can’t be the best translation Cmin

Prune Qt3 because it subsumes Qt1

That leaves Qt1 as Cmin

Qt1 = (f1: p1 p51) (f2 : p1 p5

2)Qt2 = f2

Qt3 = f3: p1

p1 = [author; contain; Tom Clancy]p5

1 = [price; between; 0-25]p5

2 = [price; between; 25-45]

Page 18: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Dynamic Predicate Mapping

Tasks: Choose operator Fill in values

Objective: Minimal subsuming between source and target

Page 19: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Dynamic Predicate Mapping

Example:

Predicate MappingPredicate Mapping

U

Input:

output:

Page 20: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

System Architecture (reminder)

Form Extractor Form Extractor

Source query Qs Target query form QI

Attribute Matcher:Syntax-based schema matching

Predicate Mapper:Type-based search-driven mapping

Query Rewriter:Constraint-based query rewriting

Target query Qt*

Domain-specificThesaurus

Domain-specific type handlers

FormAssistant

(FA)

Page 21: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Implementation – Form Assistant Toolkit

Form Extractor Parses HTML into query predicate templates [attr; op; val]

Details discussed in a different paper [3.] by same research group

Attribute Matcher (1:1) Identifies semantically corresponding attributes between forms

Customized with domain thesaurus (indexes synonyms for commonly used concepts)

Stems (e.g., “children” -> “child) and removes stop words (e.g., “the”) Matched by value type and synonym attributes

Predicate Mapper (discussed in previous slides) Query Rewriter

Well-studied problem to find minimal subsuming query of given predicate-mapped query (uses approach of [5.] by Papakonstantinou, et al)

Page 22: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Experiments Datasets

447 Deep Web sources (query forms) in 8 domains 3 “Basic” domains – each with custom thesaurus in FA

Books, Airfares, Automobiles

5 “New” domains (for tests, these don’t have thesaurus) Car Rentals, Jobs, Hotels, Movies, Music/Records

Test Approach Run the FA to translate 120 form queries

Each translation test corresponds to random pairing of sources within a domain Count correct mappings in translation suggested by FA

Indicates amount of user effort the Form Assistant has saved

Page 23: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Experiments Results: Accuracy Distributions

X: % correct predicate translations; Y: % tested query forms Forms with all 1:1 mappings had 87% perfect accuracy for Basic dataset, 85% perfect for New dataset (good domain flexibility) Forms having complex mapping: 76%, 70% “near perfect” (Y>80%)

FA did not attempt complex (n:m) mappings, such as a full name in source mapping to separate first and last names in target

For Basic dataset: For New dataset:

Page 24: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Experiments

Accuracy ratio: correct results per 1:1 query Raw: includes some forms whose input form extraction step had errors Perfect: manually forces all correct form extractions Avg. accuracy improves for perfectly correct extraction step:

for Basic dataset, 90.4% improves to 96.1% For New dataset, 81.1% improves to 86.7%

Basic: 3 domains New: 5 domains

Page 25: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Experiments Example Error in Form Extraction

delta.com form has link to alternative reservation page “One-way & multi-city reservations”

Wrongly interpreted by Form Extractor as input field label (attribute)

Page 26: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Experiments Error Distribution

% of errors caused by each component Fewest errors are due to Attribute Matching Most errors due to Predicate Mapping

Cited reason for PM errors is insufficient domain knowledge Example failure: source subject value “computer science” didn’t properly map to target subject value “programming languages” Improvement could entail better domain-specific ontology and type handlers

Attribute Matching 18%

40%

42%

Form Extraction

Predicate Mapping

Page 27: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Related Work

From the same research group: Complex Matchings (n:m)

Defines “Type Recognizer” used in Form Assistant’s Attribute Matcher, and discusses complex n:m matchings not attempted by Form Assistant: [1.] Discovering Complex Matchings across Web Query Interfaces: A Correlation

Mining Approach. B. He, K. C.-C. Chang, and J. Han. In Proceedings of the 2004 ACM SIGKDD Conference (KDD 2004) (Full Paper), Seattle, Washington, August 2004

MetaQuerier System Fuller system for both exploring (to find) and integrating (to query)

Deep Web databases: [2.] Toward Large Scale Integration: Building a MetaQuerier over Databases on the

Web. K. C.-C. Chang, B. He, and Z. Zhang. In Proceedings of the Second Conference on Innovative Data Systems Research (CIDR 2005), Asilomar, California, January 2005

Page 28: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Related Work From the same research group:

Form Extraction As used by implementation of Form Assistant:

[3.] Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Z. Zhang, B. He, and K. C.-C. Chang. In Proceedings of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France, June 2004

2007 thorough analysis of the Deep Web Interesting survey of web databases and query interfaces:

[4.] Accessing the Deep Web: A Survey. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. Communications of the ACM (CACM), 50(5):94-101, May 2007

Public Datasets Cached real world query form web pages (used in experiments):

http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8

Additional Deep Web integration resources: http://metaquerier.cs.uiuc.edu/repository

Page 29: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Related Work Query Rewriting

As used by implementation of Form Assistant: [5.] Y. Papakonstaninou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query

translation scheme for rapid implementation of wrappers In proceedings of the Fourth International Conference on Deductive and Object-Oriented Databases, Singapore, December 1995.

Page 30: Light-weight Domain-based Form Assistant: Querying Web Databases On The Fly Authors:Z. Zhang, B. He, K. C.-C. Chang (Univ. of Illinois at Urbana-Champaign)

Thank you !