[ieee 22nd international conference on data engineering (icde'06) - atlanta, ga, usa...

4
Holistic Query Interface Matching using Parallel Schema Matching Weifeng Su Uni. of Sci. & Tech. Kowloon, Hong Kong [email protected] Jiying Wang City University Kowloon, Hong Kong [email protected] Frederick Lochovsky Uni. of Sci. & Tech. Kowloon, Hong Kong [email protected] Abstract Using query interfaces of different Web databases, we propose a new complex schema matching approach, Par- allel Schema Matching (PSM). A parallel schema is formed by comparing two individual schemas and deleting common attributes. The attribute matching can be discovered from the attribute-occurrence patterns if many parallel schemas are available. A count-based greedy algorithm identifies which attributes are more likely to be matched. Experiments show that PSM can identify both simple matching and com- plex matching accurately and efficiently. 1 INTRODUCTION Considering that there are thousands of Web databases available today, it is very time consuming for an ordinary user to query and retrieve information from all the relevant databases. Since most Web databases are only accessible through a query interface, a system/tool that helps a user lo- cate information in numerous Web databases must be able to understand the query interfaces and help dispatch user queries to suitable fields of those interfaces. The main chal- lenge of such a system is that different databases may use different fields or terms to represent the same concept. For example, to describe the genre of a CD in the MusicRecords domain, Category is used in some databases while Style is used in others. In the Books domain, First Name and Last Name are used in some databases while Author is used in others to denote the writer of a book. This paper specifically focuses on matching attributes across query interfaces of structured Web databases. We de- fine an entry or field in a query interface as an attribute, and all attributes in the query interface as a schema of the inter- face. When matching the attributes, we call a 1:1 matching, such as Category with Style,a simple matching and a 1:n or m:n matching, such as First Name, Last Name with Author,a complex matching. In the latter case, attributes First Name and Last Name form a concept group before they are matched to attribute Author. We call attributes that are in the same concept group grouping attributes and at- tributes that are semantically identical or similar to each other synonym attributes. We propose a new interface schema matching approach, Parallel Schema Matching (PSM), that matches all input schemas holistically instead of matching two schemas at a time to take advantage of the occurrence patterns of the at- tributes. To better discover the frequent/rare co-presence of the attributes, we form parallel schemas by comparing two schemas and deleting their common attributes. Both time complexity analysis and experimental results show that the PSM approach can efficiently discover simple and complex matchings at the same time with high accuracy while requir- ing no domain knowledge or user interaction. 2 RELATED WORK Current solutions for the schema matching problem [1, 3, 4, 6, 7, 8, 9, 10, 12] suffer from the following limitations: 1. simple matching: most schema matching methods can only discover simple matchings between schemas. 2. low accuracy: the accuracy of methods that can iden- tify complex matchings is generally unsatisfactory. 3. time consuming: some schema matching methods have time complexity exponential in the number of at- tributes. 4. domain knowledge required: some schema matching methods require domain knowledge or user interaction before or during the matching process. DCM [5], which discovers complex matchings holis- tically using data mining techniques, is based on similar observations as PSM, but has the following disadvantages compared to PSM: 1. DCM uses H = f01f10 f+1f1+ to measure the negative corre- lation between two attributes, by which the synonym attributes are discovered. Such a measure may give a high score for rare attributes, while PSM’s matching score measure will not. Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Upload: f

Post on 09-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Holistic Query

Holistic Query Interface Matching using Parallel Schema Matching

Weifeng SuUni. of Sci. & Tech.

Kowloon, Hong [email protected]

Jiying WangCity University

Kowloon, Hong [email protected]

Frederick LochovskyUni. of Sci. & Tech.

Kowloon, Hong [email protected]

Abstract

Using query interfaces of different Web databases, wepropose a new complex schema matching approach, Par-allel Schema Matching (PSM). A parallel schema is formedby comparing two individual schemas and deleting commonattributes. The attribute matching can be discovered fromthe attribute-occurrence patterns if many parallel schemasare available. A count-based greedy algorithm identifieswhich attributes are more likely to be matched. Experimentsshow that PSM can identify both simple matching and com-plex matching accurately and efficiently.

1 INTRODUCTION

Considering that there are thousands of Web databasesavailable today, it is very time consuming for an ordinaryuser to query and retrieve information from all the relevantdatabases. Since most Web databases are only accessiblethrough a query interface, a system/tool that helps a user lo-cate information in numerous Web databases must be ableto understand the query interfaces and help dispatch userqueries to suitable fields of those interfaces. The main chal-lenge of such a system is that different databases may usedifferent fields or terms to represent the same concept. Forexample, to describe the genre of a CD in the MusicRecordsdomain, Category is used in some databases while Style isused in others. In the Books domain, First Name and LastName are used in some databases while Author is used inothers to denote the writer of a book.

This paper specifically focuses on matching attributesacross query interfaces of structured Web databases. We de-fine an entry or field in a query interface as an attribute, andall attributes in the query interface as a schema of the inter-face. When matching the attributes, we call a 1:1 matching,such as Category with Style, a simple matching and a 1:nor m:n matching, such as First Name, Last Name withAuthor, a complex matching. In the latter case, attributesFirst Name and Last Name form a concept group before

they are matched to attribute Author. We call attributes thatare in the same concept group grouping attributes and at-tributes that are semantically identical or similar to eachother synonym attributes.

We propose a new interface schema matching approach,Parallel Schema Matching (PSM), that matches all inputschemas holistically instead of matching two schemas at atime to take advantage of the occurrence patterns of the at-tributes. To better discover the frequent/rare co-presence ofthe attributes, we form parallel schemas by comparing twoschemas and deleting their common attributes. Both timecomplexity analysis and experimental results show that thePSM approach can efficiently discover simple and complexmatchings at the same time with high accuracy while requir-ing no domain knowledge or user interaction.

2 RELATED WORK

Current solutions for the schema matching problem [1,3, 4, 6, 7, 8, 9, 10, 12] suffer from the following limitations:

1. simple matching: most schema matching methods canonly discover simple matchings between schemas.

2. low accuracy: the accuracy of methods that can iden-tify complex matchings is generally unsatisfactory.

3. time consuming: some schema matching methods havetime complexity exponential in the number of at-tributes.

4. domain knowledge required: some schema matchingmethods require domain knowledge or user interactionbefore or during the matching process.

DCM [5], which discovers complex matchings holis-tically using data mining techniques, is based on similarobservations as PSM, but has the following disadvantagescompared to PSM:

1. DCM uses H = f01f10f+1f1+

to measure the negative corre-lation between two attributes, by which the synonymattributes are discovered. Such a measure may give ahigh score for rare attributes, while PSM’s matchingscore measure will not.

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 2: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Holistic Query

2. The time complexity of DCM is exponential with re-spect to the number of attributes n, while PSM’s timecomplexity is polynomial with respect to n.

3 PARALLEL SCHEMA MATCHING

We observe that Web databases in the same domain usu-ally share the following characteristics:

1. They are usually semantically similar to each other.2. An attribute is usually unambiguous in a domain al-

though it may have more than one meaning in an ordi-nary comprehensive dictionary.

3. Synonym attributes are rarely co-present in the sameinterface schema.

4. Grouping attributes are usually co-present in the sameinterface schema to form a “larger” concept.

We formalize the schema matching problem as the sameproblem described in [5]. The input is a set of schemasS = {S1, . . . , Su}, in which each schema Si (1 ≤ i ≤ u)contains a set of attributes extracted from a query interfaceand the set of attributes A = ∪u

i=1Si = {A1, . . . , An} in-cludes all attributes in S. We assume that these schemascome from the same domain. The schema matching prob-lem is to find all matchings M = {M1, . . . , Mv} includ-ing both simple and complex matchings. A matching Mj

(1 ≤ j ≤ v) is represented as Gj1 = Gj2 = . . . = Gjw,where Gjk (1 ≤ k ≤ w) is a group of attributes and Gjk

is a subset of A, i.e., Gjk ⊂ A. Each matching Mj shouldrepresent the semantic synonym relationship between twoattribute groups Gjk and Gjl (l �= k), and each group Gjk

should represent the grouping relationship between the at-tributes within it1. More specifically, based on observations2 and 4 above, we restrict each attribute to appear no morethan one time in M.

The workflow of PSM is shown in Figure 1. Before thematching discovery, two scores, matching score, which isused to evaluate the possibility that two attributes are syn-onym attributes, and grouping score, which is used to eval-uate the possibility that two attributes are in the same groupin a matching, are calculated between every two attributes.

Synonym Attribute Candidate Generation takes allschemas as input and generates all synonym attribute can-didates based on observation 3. If there are n attributes inthe input schemas, the maximum number of synonym at-tribute candidates is C2

n = n(n−1)2 . However, not any two

attributes from A can be actual candidates for synonym at-tributes. We assume that two attributes (Ap, Aq) are syn-onym attribute candidates if Ap and Aq are co-present inless than Tpq schemas where Tpq is defined as:

Tpq =(Cp + Cq) lnu

u, (1)

1An attribute group can have just one attribute.

Figure 1: Parallel Schema Matching Workflow.

and Cp and Cq are the count of attribute Ap and Aq in S.Parallel Schema Construction takes every pair of input

schemas and constructs a set of parallel schemas in threesteps. First, every two different schemas are paired to formpreliminary parallel schemas. If there are u input schemas,C2

u = u(u−1)2 preliminary parallel schemas will be ob-

tained. Second, in every preliminary parallel schema, com-mon attributes that are shared by its two attribute sets aredeleted based on observation 2. Finally, parallel schemasthat contain at least one empty attribute set are discarded.

After deleting the common attributes, the relationshipbetween attribute Ap’s count in the input schema set Q withAp’s count in the parallel schema set S is

Dp = Cp(u − Cp), (2)

where Cp and Dp are the count of Ap in S and Q, respec-tively.

Matching Score Calculation calculates matching scoresbased on the co-presence count of the attributes in the par-allel schemas.

Definition 1 Given a parallel schema Ql = (Sl1, Sl2) andtwo attributes Ap ∈ Ql and Aq ∈ Ql, if Ap ∈ Sl1 andAq ∈ Sl2 or vice versa, Ap and Aq are defined as beingcross co-present in Ql, denoted as (Ap, Aq)∈̇Ql.

The cross co-presence count of two attributes in the parallelschema set is used to calculate the matching score betweenthem. During the calculation, we assume that each paral-lel schema provides the same amount of information aboutmatching, thus each cross co-present attribute pair in it willgain equal cross co-presence weight.

Definition 2 Given a set of parallel schemas Q = {Ql =(Sl1, Sl2), l = 1 . . . m} and two attributes Ap and Aq , a

2

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 3: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Holistic Query

Domain Discovered Matching Correct?Airfares {departure date (datetime), return date (datetime)} = {depart (datetime), return (datetime)} Y

{adult (integer), children (integer), infant (integer), senior (integer)} ={passenger (integer)} Y{destination (string)} = {from (string), to (string)} ={arrival city (string), departure city (string)} P

{cabin (string)} = {class (string)} YCarRentals {drop off city (string), pick up city (string)} ={drop off location (string), pick up location (string)} Y

{drop off (datetime), pick up (datetime)= { pick up date (datetime), Ydrop off date (datetime), pick up time (datetime), drop off time (datetime)}

Table 1: Discovered matchings for Airfares and CarRentals (T =10%).

weighted cross co-presence count of Ap and Aq in Q is de-fined as the sum of the cross co-presence weight that Ap

and Aq gains in each parallel schema of Q, denoted asD̃pq =

∑(Ap,Aq)∈̇Ql

1|Sl1||Sl2| .

For any two attributes Ap and Aq , a matching score Xpq

measures the possibility that Ap and Aq are synonym at-tributes. The bigger the score, the more likely that the twoattributes are synonym attributes. The matching score cal-culation between Ap and Aq is similar to Dice coefficient:

Xpq =

{0 if (Ap, Aq) /∈ L

2D̃pq

(Cp+Cq) otherwise,(3)

where L is the set of synonym attribute candidates and Cp

and Cq are the count of attributes Ap and Aq in S.This matching score has the following properties:

1. null invariance [11]. For any two attributes, addingmore schemas that do not contain the attributes doesnot affect their matching score.

2. rareness differentiation. The matching score betweenrare and the other attributes is distinguishably low.

Grouping Score Calculation takes all schemas as inputand calculates the grouping score between every two at-tributes based on observation 4. We use the followinggrouping score measure between two attributes Ap and Aq:

Ypq =Cpq

min(Cp, Cq), (4)

where Cpq is the co-presence count of attributes Ap and Aq

in S, and Cp and Cq are the count of attributes Ap and Aq

in S.We also set a grouping score threshold Tg such that at-

tributes Ap and Aq will be considered grouping attributesonly when Ypq > Tg . Practically, Tg should be close to 1 asthe grouping attributes are expected to co-occur often. Tg isan empirical parameter and the experimental results showthat it has similar performance in a wide range.

Schema Matching Discovery uses an iterative algorithmwhere in each iteration, a greedy selection strategy is usedto choose the synonym attribute candidates with the high-est matching score, until there is no synonym attribute can-didate available. The greediness of this algorithm has the

benefit of filtering bad matchings in favor of good ones. An-other interesting and beneficial characteristic of this algo-rithm is that it is matching score centric, i.e., the matchingscore plays a much more important role than the groupingscore.

4 EXPERIMENTS

We use the TEL-8 and BAMM datasets from the UIUCWeb integration repository [2]. The TEL-8 dataset con-tains query interface schemas extracted from 447 deep Websources of eight representative domains where each domaincontains about 20-70 schemas. The BAMM dataset con-tains query interface schemas extracted from four domainswhere each domain has about 50 schemas.

We evaluate the set of matchings automatically discov-ered by PSM, denoted by Mp, by comparing it with the setof matchings manually collected by a domain expert, de-noted by Mc. To facilitate comparison, we adopt the metricin [5], target accuracy, which evaluates how similar Mp isto Mc. The target accuracy metric includes target precisionand target recall. The target precision and target recall ofMp with respect to Mc are the weighted average of all theattributes’ target precision and target recall.

Result on the TEL-8 dataset: Table 1 shows the match-ings discovered by PSM in the Airfares and CarRentals do-mains, when T is set at 10%. We see that PSM can identifyvery complex matchings among attributes. Table 2 presentsthe performance of PSM on TEL-8 when Tg is set to 0.9.As expected, the performance of PSM decreases for rare at-tributes because the occurrence pattern of the rare attributesis not obvious with only a few occurrences. Nevertheless,the performance of PSM is almost always better than theperformance2 of DCM in [5], shown in Table 3.

Result on the BAMM dataset: The performance of PSMon BAMM is shown in Table 4, when Tg is set to be 0.9. Wecan see that PSM performs well on BAMM too, except thatthe target precision in the Automobiles domain is low whenT =10%. The reason is similar to that for TEL-8.

2The precision of DCM when T =5% is not available in [5].

3

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE

Page 4: [IEEE 22nd International Conference on Data Engineering (ICDE'06) - Atlanta, GA, USA (2006.04.3-2006.04.7)] 22nd International Conference on Data Engineering (ICDE'06) - Holistic Query

Domain T =20% T =10% T =5%PT RT PT RT PT RT

Airfares 1 1 1 .94 .90 .86Automobiles 1 1 1 1 .76 .88

Books 1 1 1 1 .67 1CarRentals 1 1 .89 .91 .64 .78

Hotels 1 1 .72 1 .60 .88Jobs 1 1 1 1 .70 .72

Movies 1 1 1 1 .72 1MusicRecords 1 1 .74 1 .62 .88

Average 1 1 .92 .98 .70 .88

Table 2: Target accuracy of PSM on TEL-8 dataset (Tg=0.9).

Domain T =20% T =10%PT RT PT RT

Airfares 1 1 1 .71Automobiles 1 1 .93 1

Books 1 1 1 1CarRentals .72 1 .72 .60

Hotels .86 1 .86 .87Jobs 1 .86 .78 .87

Movies 1 1 1 1MusicRecords 1 1 .76 1

Average .95 .98 .88 .88

Table 3: Target accuracy of DCM on TEL-8 dataset.

5 SUMMARY

We present a parallel schema matching approach toholistically discover attribute matchings across Web queryinterfaces. PSM is purely based on the occurrence patternsof attributes and requires neither domain-knowledge noruser interaction. Experimental results show that PSM dis-covers both simple and complex matchings with very highaccuracy in time polynomial to the number of attributesand the number of schemas.

Acknowledgment: This research was supported by theResearch Grants Council of Hong Kong under grantHKUST6172/04E.

References

[1] D. Beneventano, S. Bergamaschi, S. Castano, A. Corni,R. Guidetti, G. Malvezzi, M. Melchiori, and M. Vincini.Information integration: The momis project demonstration.In 26th Int. Conf. Very Large Data Bases, pages 611–614,2000.

[2] K. C.-C. Chang, B. He, C. Li, and Z. Zhang. TheUIUC Web integration repository. Computer ScienceDepartment, University of Illinois at Urbana-Champaign.http://metaquerier.cs.uiuc.edu/repository, 2003.

[3] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domin-gos. imap: Discovering complex semantic matches betweendatabase schemas. In ACM SIGMOD Conference, pages 383– 394, 2004.

[4] A. Doan, P. Domingos, and A. Y. Halevy. Reconcilingschemas of disparate data sources: A machine-learning ap-

Domain T =20% T =10% T =5%PT RT PT RT PT RT

Automobiles 1 1 .55 1 .75 1Books 1 1 .86 1 .82 1Movies 1 1 1 1 .90 .86

MusicRecords 1 1 .81 1 .72 1

Average 1 1 .81 1 .80 .97

Table 4: Target accuracy of PSM on BAMM dataset (Tg=0.9).

proach. In ACM SIGMOD Conference, pages 509 – 520,2001.

[5] B. He and K. C.-C. Chang. Discovering complex match-ings across Web query interfaces: A correlation mining ap-proach. In ACM SIGKDD Conference, pages 147 – 158,2004.

[6] B. He, K. C.-C. Chang, and J. Han. Statistical schemamatching across Web query interfaces. In ACM SIGMODConference, pages 217 – 228, 2003.

[7] W. Li, C. Clifton, and S. Liu. Database Integration us-ing Neural Network: Implementation and Experience. InKnowledge and Information Systems,2(1), pages 73–96,2000.

[8] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarityflooding: A versatile graph matching algorithm. In 18th Int.Conf. on Data Engineering, pages 117–128, 2002.

[9] T. Milo and S. Zohar. Using schema matching to simplifyheterogeneous data translation. In 24th Int. Conf. Very LargeData Bases, pages 122–133, 1998.

[10] E. Rahm and P. A. Bernstein. A survey of approaches toautomatic schema matching. The VLDB Journal, 10:334–350, 2001.

[11] P. Tan, V. Kumar, and J. Srivastava. Selecting the rightinterestingness measure for association patterns. In ACMSIGKDD Conference, pages 32 – 41, 2002.

[12] W. Wu, C. Yu, A. Doan, and W. Meng. An interactiveclustering-based approach to integrating source query inter-faces on the deep Web. In ACM SIGMOD Conference, pages95–106, 2004.

4

Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) 8-7695-2570-9/06 $20.00 © 2006 IEEE