m anaging u ncertainty of xml s chema m atching reynold cheng, jian gong, david w. cheung...

45
MANAGING UNCERTAINTY OF XML SCHEMA MATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

Upload: regina-carpenter

Post on 11-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

MANAGING UNCERTAINTY OF XML SCHEMA MATCHING

Reynold Cheng, Jian Gong, David W. Cheung

ICDE’2010

Page 2: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

22

THE DATA INTEGRATION PROBLEM Querying the source data through target query

interface Eg.: querying multiple data sources through a mediate query

interface

Data source

Query interface Target schema

Source schema

Schema mapping

2

…… ……

Page 3: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

SCHEMA MATCHING & MAPPING Schema matching: finding element correspondences

with similarities between schemas Schema mapping: a set of one-to-one

correspondences between two schemas Generation: pick up the best correspondences

3

Sample mapping Order - ORDER BP - IP BCN – ICN ……

Page 4: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

44

SCHEMA MAPPING AND UNCERTAINTY The mapping between schemas can be uncertain

Compute Pr(Mi) by: 1) aggregating similarities of correspondences, and 2) normalizing probabilities of top-k mappings

Which one is correct?

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Example: Purchase Order schemas

4

Page 5: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

55

DATA INTEGRATION RELOADED Managing uncertainty of XML schema matching

Issues: mapping generation and storage, query evaluation etc

Data source

Query interface Mediate schema

Source schema

Uncertain schema mapping

5

…… ……

Page 6: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

66

OBSERVATION

Sharing among uncertain mappings

Uncertain mappings

Overlapping: “Order~ORDER” shared by m1-m5

“BP~IP” shared by m1, m2, m4, m5

“BCN~ICN” shared by m1, m2

… 6

Page 7: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

77

OBSERVATION How much overlapping are there in real world schema

mappings? Overlapping ratio (o-ratio): the average overlap of the top-

100 possible schema mappings

7

Page 8: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

OUR CONTRIBUTION Propose block tree: a novel data structure to represent

a set of mappings Definition Efficient generation

Propose probabilistic twig query (PTQ) Definition Efficient evaluation with the block tree Top-k PTQ, and its computation issue

Improve the possible mapping generation process A divide-and-conquer approach

Conduct experiment on real data to validate our methods

8

Page 9: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

RELATED WORK Schema matching approaches and tools [RB01]

COMA [DR02]

Managing uncertainty in schema matching Top-k schema mappings [Gal06] Generating top-k mappings [Murty86]

Query evaluation in data integration Theoretical foundation [Len02] Data integration with uncertainty [DHY07] XML query rewriting for data integration [YP04]

XML query evaluation Twig query [QYD07] Querying probabilistic XML document [KYS08] 9

Page 10: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1010

OUTLINE

Introduction Problem

Data model Query model

Techniques Results Conclusion

10

Page 11: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1111

DATA MODEL XML schema and document [QYD07]

Node-labeled tree Document node may carry text values

Schema mapping [DHY07] One-to-one mapping

11

Schema

Schema

Document

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Page 12: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1212

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

M1: Order-ORDER, BP-IP, BCN-ICN, …

12

Source query: Target query:

Source schema: Target schema:

Page 13: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1313

QUERY MODEL (SINGLE MAPPING) Twig query through a target schema [YP04]

Step 1: rewrite target query into source query, based on schema mapping

Step 2: evaluate source query on source document

13

Source query:

Source document:

Page 14: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1414

QUERY MODEL (UNCERTAIN MAPPINGS) Query evaluation with uncertain mappings [DHY07]

Mappings: pM = {(M1,Pr(M1)), …, (Mh,Pr(Mh)} The query answers from mapping Mi have probability Pr(Mi)

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

14

Source query

Page 15: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1515

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

15

Page 16: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1616

THE BLOCK Each block, which is attached to a target schema

element, consists of: C: A set of correspondences M: A set of mappings

Block Block Block

16

Drawback: Exponential number of blocks to handle

Semantic: mappings in M share correspondences in C

Page 17: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1717

THE C-BLOCK A c-block (constrained block) is a block which:

Contains correspondence for all elements in its sub-tree (so that it’s more useful for query evaluation)

Contains shared mappings more than a threshold (else it’s not worthy to store it)

17

c-block

|pM| = 5Threshold = 0.4

Page 18: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1818

THE BLOCK TREE Creation of the block tree

Follows the structure of the target schema A bottom-up method

18

Lemma 1: (informal)The c-blocks for an element can be created from the c-blocks of its children.(detail)

Lemma 2: (informal)If an element has no c-block, then its parent (if any) has no c-blcok.

Page 19: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

1919

THE BLOCK TREE Reducing the storage cost of uncertain mappings

IP

b4

b3

ICN

g2g1

b2

b1C: BCN~ICN

M: m1, m2

C: RCN~ICNM: m3, m4

C: OCN~SCNM: m2, m3

SCN

C: BCN~SCNM: m4, m5

b5

C: BP~IPM: m1, m2, m4, m5

C: BP~IP, BCN~ICNM: m1, m2

SP

...

ORDER

g3C: Order~ORDER

M: m1, m2, m3, m4, m5

m1 Order~ORDER

RCN~SCN...

m2 Order~ORDER

OCN~SCN...

b2.C

b3.C

b2.C

b4.C

m4 Order~ORDER BP~IP

...

b4.C

m5 Order~ORDER BP~IP OCN~ICN ...

b5.C b5.C

m3 Order~ORDER SP~IP BP~SP...

If part of a mapping is in the block tree, then replace it with a link

Page 20: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

2020

OUTLINE

Introduction Problem Techniques

Block tree Query evaluation Mapping generation

Results Conclusion

20

Page 21: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

2121

QUERY EVALUATION AND UNCERTAINTY The uncertainty in mappings may affect query

answers

Uncertain mappings M1: Order-ORDER, …, BCN-ICN, … M2: Order-ORDER, …, RCN-ICN, … …

Target query Q: //ICN

which finds all ICNs (contact names of invoice parties) in the purchase order

Example: a source document

Return by M1

Return by M2

21

Page 22: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

2222

THE BASELINE APPROACH

Evaluate QT with each mapping in pM separately Drawback

When the mapping Mi is large, or h is large, the computation cost is expensive

Target query QT

M1,Pr(M1)

Mh,Pr(Mh)

R1,Pr(M1)

Rh,Pr(Mh)

QS1

QSh

Rewriting Evaluation

DS

DS

Page 23: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

23

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Page 24: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

24

IP

ICN

QUERY EVALUATION WITH BLOCK TREE Case 1): the root is found in the block tree, then use the

blocks to evaluate the whole query Only one mapping in the block is used Deal with remainder mappings

Page 25: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

25

QUERY EVALUATION WITH BLOCK TREE Consider the root of a query

Case 1): the root is found in the block tree, then use the blocks to evaluate the whole query

Case 2): the root is not found, decompose the query (if possible), invoke recursion, and join partial answers

Page 26: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

26

IP

ICN

ORDER

SP

QUERY EVALUATION WITH BLOCK TREE Case 2): the root is not found, decompose the query (if

possible), invoke recursion, and join partial answers

ORDERIP

ICN

SP+ +

Direct query

Recursion Direct query

Page 27: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

2727

OUTLINE

Introduction Problem

Data model Query model

Techniques Block tree Query evaluation Mapping generation

Results Conclusion

27

Page 28: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

28

MAPPING GENERATION A mapping m for a schema S with another schema T

contains a set of correspondences (es,et) et may be EMPTY, i.e., es matches none element in T Each element in S occurs exactly once in m Each element in T occurs at most once in m m’s score is the sum of similarities of its correspondences

Problem definition Given: two schemas S and T, a set of correspondences

(es,et) with similarities (which are schema matching results) Return: h mappings m1, …, mh, whose scores are among the

highest ones

Page 29: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

29

MAPPING GENERATION Baseline solution

Finding h-maximum bipartite matching (Min-Cost Flow) Polynomial with the size of bipartite

Page 30: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

30

MAPPING GENERATION Observation: XML schema matching is usually sparse Improvement: a divide-and-conquer approach

Derive partitions (Maximal Connected Sub-Graphs) of the bipartite

Find the top-h partial mappings from each partition Merge

Page 31: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

3131

OUTLINE

Introduction Problem Techniques Results Conclusion

31

Page 32: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

32

DATASET AND RESULTS XML schemas and documents

7 schemas for purchase order, obtained from various E-Commence standards (eg. XCBL, OpenTrans)

Accompanied sample XML documents

Schema matching Tool: COMA++, with different schema matching methods 10 dataset: (source-schema, target-schema, matching-

method)

Target query 10 hand-write queries

Page 33: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

33

RESULTS Uncertain mappings, do they really overlap?

Page 34: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

34

RESULTS How much space does the block tree save for storing

uncertain mappings? And why?

Page 35: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

35

RESULTS Is the block tree effective?

Intuitively, larger blocks tends to be more useful

Page 36: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

36

RESULTS The block tree can be efficiently created

Fast, and controllable

Page 37: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

37

RESULTS Can the block tree really help to improvement query

performance? Varies the total number of mappings

Page 38: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

38

RESULTS Can it scale?

Probabilistic twig query and top-k query

Page 39: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

39

RESULTS Top-h mapping generation

Performance gain of partitioning

Page 40: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

40

CONCLUSION We study the problem of handling uncertainty in XML

schema matching Observation

Overlapping mappings, sparse bipartite, etc Approach

The block tree Query evaluation with the block tree Generating uncertain mapping more efficiently

Future work Other types of queries, probabilistic document, index

update, relational scenario, etc

Page 41: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

4141

THANKS!

Q & A

41

Page 42: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

REFERENCES [Len02] Lenzerini, “Data integration: a theoretical perspective”, in

PODS, 2002 [YP04] Yu et al, “Constraint-based XML query rewriting for data

integration”, in SIGMOD, 2004 [DR02] Do et al, “COMA: a system for flexible combination of schema

matching approaches”, in VLDB, 2002 [Gal06] Gal, “Managing uncertainty in schema matching with top-k

schema mappings”, in J. Data Semantics VI, 2006 [DHY07] Dong et al, “Data integration with uncertainty”, in VLDB, 2007 [QYD07] Qin et al, “TwigList: make twig pattern matching fast”, in

DASFAA, 2007 [Murty86] Murty, “An algorithm for ranking all the assignment in

increasing order of cost”, Operations Research, vol 16, 1986 [RB01] Rahm et al, “A survey of approaches to automatic schema

matching”, VLDB J, vol 10, 2001 [KYS08] Kimelfeld et al, “Query efficiency in probabilistic XML models”,

in SIGMOD, 2008 …

42

Page 43: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

4343

QUERY REWRITING

Given A target twig query QT

A schema mapping m between S and T, which is a set of correspondences (es,et)

Mapping semantic For each sub-tree in source document DS which

contains a set of source element in m, there exists a sub-tree in target document DT which contains the corresponding target elements

Procedure For each element in QT, replace with a source

element Connect all the source elements

Page 44: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

4444

LEMMA 1

An example

Lemma 1: (conceptually)The c-blocks for an schema element t can be created from the c-blocks of t’s children.(detail)

Order

InvoiceTo

27|24|25|24

name

Address

streetemail city country

DeliverTo

27|24|25|24

name

Address

streetemail city country

ContactContact

51|49 49|5110052|48 53|4749|5110052|48 50|50 51|49

...

b1.M: 1-52b2.M: 53-100

b3.M: 1,3,5,…b4.M: 2,4,6,...

Page 45: M ANAGING U NCERTAINTY OF XML S CHEMA M ATCHING Reynold Cheng, Jian Gong, David W. Cheung ICDE’2010

45

RESULTS

What kind of queries do we used?