an interactive clustering-based approach to integrating source query interfaces on the deep web

24
Wensheng Wu 1, Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

Upload: emiko

Post on 25-Jan-2016

52 views

Category:

Documents


1 download

DESCRIPTION

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. Wensheng Wu 1 , Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3

1 University of Illinois at Urbana-Champaign2 University of Illinois at Chicago

3 SUNY at Binghamton

June 2004, Paris, France

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the

Deep Web

Page 2: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

2

Access Deep Web Sources

united.com airtravel.com

delta.com hotwire.com

Page 3: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

3

Global Query Interface

united.com airtravel.com

delta.com hotwire.com

Page 4: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

4

Constructing Global Query Interface A unified query interface with these desired features:

Conciseness - Combine semantically

similar fields over source interfaces Completeness - Retain source-specific fields User-friendliness – Highly related fields

are close together

Two-phrased integration Interface MatchingInterface Matching – Identify semantically similar fields

Interface Integration – Merge the source query interfaces

Page 5: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

5

Interface Matching – Challenges

Field A in one interface is semantically similar

to field B in another interface, but

have nothing in common. E.g.,

sim(A,B) = sim(A,C), which field should A match? E.g.,

x

x

?

Page 6: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

6

Interface Matching – Challenges (Cont’d)

1:m mappings: E.g.,

Determine matching threshold

?

Page 7: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

7

Existing Common Limitations

Limitation 1: Non-hierarchical modeling

Limitation 2: Do not handle 1:m mappings or handle them with low accuracy

Limitation 3: Does not allow limited user interactions

Detailed comparisons given in paper …

Page 8: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

8

The IceQ’s Approach [SIGMOD-04]

Hierarchical modeling Let’s be out of “flat” land

“Greedy” is good Always start with the most confident matching

Bridging effect “a2” and “c2” might not look similar themselves

but they might both be similar to “b3”

1:m mappings Aggregate and is-a types

User interaction helps in: Interactive learning of matching threshold Resolution of uncertain mappings

X

0.50.8

Pick this!

Page 9: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

9

Hierarchical Modeling

Source Query Interface

Ordered Tree Representation

Capture: ordering and grouping of fields

Page 10: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

10

Field Similarity Function Each field may have a label, a name and a set of values, e.g.,

Evaluate the similarity sim(A,B) between two fields, A and B, based on:

Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function

Domain similarity by domain type and domain value similarity

Linguistic similarity

Domain similarity

Page 11: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

11

Find 1:1 Mappings via ClusteringInterfaces:

After one merge:

…, final clusters:{{a1,b1,c1}, {b2,c2},{a2},{b3}}

(Threshold = .3)

Initial similarity matrix:

Page 12: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

12

“Bridging” Effect

?

A

CB

Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!

Page 13: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

13

“Bridging” Effect (Cont’d)

hotfares.com

airtickets.com

airtravel.com

??

Connections might also be made via labels

Page 14: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

14

Field Ordering-based Tie Resolution

A1 A2

B2

B1

Question: sim(A1, B1) = sim(A1, B2), which one should A1 match?

Observation: the ordering of fields conveys semantics!

0.35

0.35

0.35

0.35

Page 15: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

15

Complex Mappings

Aggregate type – contents of fields on the many side are part ofthe content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

Page 16: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

16

Complex Mappings (Cont’d)

Is-a type – contents of fields on the many side are sum/union ofthe content of field on the one side

Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics

Page 17: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

17

Complex Mappings (Cont’d) Final 1-m phase infers new mappings:

Preliminary 1-m phase: a1 (b1, b2)Clustering phase: b1 c1, b2 c2Final 1-m phase: a1 (c1, c2)

Page 18: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

18

Active Learning of Thresholds Observation: In an ideal situation,

if field A matches with some field X, then sim(A, X) > threshold T1

if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1

.91

.8

.73

.62

.46

.2

.03

List 1

.87

.82

.6

.53

.5

.33

.28

List 3

.62

.53

.5

.48

.46

.32

.1

List 2

Initial B: [0,.4]

Drop rule: 50%

List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2]List2: question on .1, answer yes, update B=[.03, .1]

List3: no values within B

Threshold set to any value between .03 and .1

Page 19: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

19

Interactive Resolution of Uncertain Mappings Resolve potential homonyms

Observation: two fields are

possible homonyms if their

labels are highly similar

while domains are not.

Determine potential synonyms Observation: Two fields might still be similar

if there are common values in their

domains even if their label/domain

similarities are low

x

=

X

Page 20: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

20

Interactive Resolution of Uncertain Mappings Determine potential 1:m mappings

Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b)

?

Page 21: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

21

Empirical Evaluations

Automatic field matching

Accuracy with learned thresholds

Distribution of questions

Accuracy with all user interactions

Page 22: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

22

Comparison of Component Contributions

On average, 12.6% increase in recall

15.4%

7.3%

Page 23: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

23

Summary

High accuracy of determining matching fields across multiple user interfaces

Limited use of user interactions

Page 24: An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web

24

Future Research

Improve the accuracy of determining matching fields further

Decrease the number of  user interactions

Produce unified friendly user interface

Provide such a tool on the Web