an interactive clustering-based approach to integrating source query interfaces on the deep web
DESCRIPTION
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web. Wensheng Wu 1 , Clement Yu 2 , AnHai Doan 1 , Weiyi Meng 3 1 University of Illinois at Urbana-Champaign 2 University of Illinois at Chicago 3 SUNY at Binghamton June 2004, Paris, France. - PowerPoint PPT PresentationTRANSCRIPT
Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3
1 University of Illinois at Urbana-Champaign2 University of Illinois at Chicago
3 SUNY at Binghamton
June 2004, Paris, France
An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the
Deep Web
2
Access Deep Web Sources
united.com airtravel.com
delta.com hotwire.com
3
Global Query Interface
united.com airtravel.com
delta.com hotwire.com
4
Constructing Global Query Interface A unified query interface with these desired features:
Conciseness - Combine semantically
similar fields over source interfaces Completeness - Retain source-specific fields User-friendliness – Highly related fields
are close together
Two-phrased integration Interface MatchingInterface Matching – Identify semantically similar fields
Interface Integration – Merge the source query interfaces
5
Interface Matching – Challenges
Field A in one interface is semantically similar
to field B in another interface, but
have nothing in common. E.g.,
sim(A,B) = sim(A,C), which field should A match? E.g.,
x
x
?
6
Interface Matching – Challenges (Cont’d)
1:m mappings: E.g.,
Determine matching threshold
?
7
Existing Common Limitations
Limitation 1: Non-hierarchical modeling
Limitation 2: Do not handle 1:m mappings or handle them with low accuracy
Limitation 3: Does not allow limited user interactions
Detailed comparisons given in paper …
8
The IceQ’s Approach [SIGMOD-04]
Hierarchical modeling Let’s be out of “flat” land
“Greedy” is good Always start with the most confident matching
Bridging effect “a2” and “c2” might not look similar themselves
but they might both be similar to “b3”
1:m mappings Aggregate and is-a types
User interaction helps in: Interactive learning of matching threshold Resolution of uncertain mappings
X
0.50.8
Pick this!
9
Hierarchical Modeling
Source Query Interface
Ordered Tree Representation
Capture: ordering and grouping of fields
10
Field Similarity Function Each field may have a label, a name and a set of values, e.g.,
Evaluate the similarity sim(A,B) between two fields, A and B, based on:
Linguistic similarity by label similarity, name similarity and name vs. label similarity, each measured by Cosine function
Domain similarity by domain type and domain value similarity
Linguistic similarity
Domain similarity
11
Find 1:1 Mappings via ClusteringInterfaces:
After one merge:
…, final clusters:{{a1,b1,c1}, {b2,c2},{a2},{b3}}
(Threshold = .3)
Initial similarity matrix:
12
“Bridging” Effect
?
A
CB
Observations: - It is difficult to match “vehicle” field, A, with “make” field, B - But A’s instances are similar to C’s, and C’s label is similar to B’s - Thus, C might serve as a “bridge” to connect A and B!
13
“Bridging” Effect (Cont’d)
hotfares.com
airtickets.com
airtravel.com
??
Connections might also be made via labels
14
Field Ordering-based Tie Resolution
A1 A2
B2
B1
Question: sim(A1, B1) = sim(A1, B2), which one should A1 match?
Observation: the ordering of fields conveys semantics!
0.35
0.35
0.35
0.35
15
Complex Mappings
Aggregate type – contents of fields on the many side are part ofthe content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
16
Complex Mappings (Cont’d)
Is-a type – contents of fields on the many side are sum/union ofthe content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity, and (3) value characteristics
17
Complex Mappings (Cont’d) Final 1-m phase infers new mappings:
Preliminary 1-m phase: a1 (b1, b2)Clustering phase: b1 c1, b2 c2Final 1-m phase: a1 (c1, c2)
18
Active Learning of Thresholds Observation: In an ideal situation,
if field A matches with some field X, then sim(A, X) > threshold T1
if field A does not match with any field, then for any C, max{sim(A, C)} < T2, where T2 < T1
.91
.8
.73
.62
.46
.2
.03
List 1
.87
.82
.6
.53
.5
.33
.28
List 3
.62
.53
.5
.48
.46
.32
.1
List 2
Initial B: [0,.4]
Drop rule: 50%
List1: (1) question on .2, answer yes, update B = [0, .2], continue on list 1 (2) question on .03, answer no, update B = [.03, .2]List2: question on .1, answer yes, update B=[.03, .1]
List3: no values within B
Threshold set to any value between .03 and .1
19
Interactive Resolution of Uncertain Mappings Resolve potential homonyms
Observation: two fields are
possible homonyms if their
labels are highly similar
while domains are not.
Determine potential synonyms Observation: Two fields might still be similar
if there are common values in their
domains even if their label/domain
similarities are low
x
=
X
20
Interactive Resolution of Uncertain Mappings Determine potential 1:m mappings
Observation: A might still match with B and C if (a) sim(A,B) is very close to sim(A,C); (b) B and C are adjacent; and (c) A is the only field in its interface which satisfies (a) and (b)
?
21
Empirical Evaluations
Automatic field matching
Accuracy with learned thresholds
Distribution of questions
Accuracy with all user interactions
22
Comparison of Component Contributions
On average, 12.6% increase in recall
15.4%
7.3%
23
Summary
High accuracy of determining matching fields across multiple user interfaces
Limited use of user interactions
24
Future Research
Improve the accuracy of determining matching fields further
Decrease the number of user interactions
Produce unified friendly user interface
Provide such a tool on the Web