icde 06 04.05.06

25
1 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities ICDE 06 04.05.06 Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt)

Upload: karen-lucas

Post on 31-Dec-2015

32 views

Category:

Documents


0 download

DESCRIPTION

ICDE 06 04.05.06. Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt). Overview of the talk. Data Integration in Large-Scale Information Systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ICDE 06         04.05.06

1

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

ICDE 06 04.05.06

Probabilistic Message Passing in Peer Data Management Systems

Philippe Cudré-Mauroux, EPFL

Joint work with:

Karl Aberer (EPFL)Andras Feher (T.U. Darmstadt)

Page 2: ICDE 06         04.05.06

2

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Overview of the talk

• Data Integration in Large-Scale Information Systems– Peer Data Management Systems (PDMS)

• Query Routing in PDMS– Precision / Recall tradeoff

• Probabilistic Message Passing– Deriving quality measures for the mappings

• Conclusions

Page 3: ICDE 06         04.05.06

3

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Classical Data Integration: LAV/GAV

• Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources

• Not applicable to large-scale, decentralized contexts– Scale (upper ontologies?)– Churn– Autonomy

• How can we foster semantic interoperability in decentralized settings?

Date

myDate yourDate

m(yourDate) = Datem(myDate) = Date

Page 4: ICDE 06         04.05.06

4

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Peer Data Management Systems (1)

Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%"

<Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells </Item> <Item>Royal Council</Item> </Bag> </Subject> …</Photoshop_Image>

Photoshop(own schema)

<WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> …</WinFSImage>

WinFS (known schema)

T12 =<Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage

Q2=<GUID>$p/GUID</GUID> FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"

Extending data integration techniques to decentralized settings

Page 5: ICDE 06         04.05.06

5

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Peer Data Management Systems (2)

• Pairwise mappings• Local mappings overcome global heterogeneity

– Iterative query reformulation

<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>

date?

<es:cDate> 05/08/2004 </es:cDate>

<myRDF:Date> Jan 1, 2005 </myRDF:Date>

articleweatheres:cDate xap:CreateDate

es:cDate

myR

DF:D

ate

myR

DF:

Dat

e

xap

:Mod

ifyD

ate

Page 6: ICDE 06         04.05.06

6

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

PDMS Examples

• Some academic systems– Piazza– Hyperion– BestPeer– GridVine– …

• Out there on the Internet– The Sequence Retrieval System (SRS)

• 388 schemata (May 05, EBI repository)• 518 mappings (ID <-> ID)• Power-law distribution of node degrees• Clustering coefficient = 0.32• Diameter = 9

– Semantic Overlay Networks• P2P + semi-structured data

– The Semantic Web

Page 7: ICDE 06         04.05.06

7

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Data in large-scale PDMS

• Large-Scale PDMS

– Number of sources > 100– Unreliable data

• Autonomy– Semi-structured data

• E.g., XML/RDF– No integrity constraints– No transactions– Simple SP queries

• E.g., triple patterns, ranking

– Schemata created by end users

– Network churn

• Distributed Databases

– Number of sources < 100– Consistent data

• Coordination– Structured data

• E.g., Relational data model– Integrity constraints– Transactions– Powerful queries

• E.g., SQL, aggregation– Schemas created by

administrators– Relatively Fixed topology

VS

Page 8: ICDE 06         04.05.06

8

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Problem: Precision/Recall Tradeoff (1)

• Semantic Query routing– To whom shall I forward a query posed against my local

schema?

• Some (most) mappings will be (partially) faulty– Low expressive power of mapping languages

• samePropertyAs / sameClassAs / subclassOf• … or event worse (Microformats)

– Automatic schema alignment techniques– Different views on conceptualizations

• Local query resolution– Low recall

• Flooding (PDMS so far)– Low precision

Page 9: ICDE 06         04.05.06

9

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Problem: Precision/Recall Tradeoff (2)

• Standard deductive integration is not sufficient– Uncertainty on mappings and conceptualizations

• Probabilistic Message Passing– Deriving quality measures for the mappings

• Reduces uncertainty• Used to route query / optimize mappings

– Based on a notion of agreement on conceptualizations• Decentralized decision making, Emergent Semantics

• From Schema Matching to Probabilistic Message Passing– Automatic Schema Matching

• INPUT: 2 schemas + data• OUTPUT: 1 mapping

– Probabilistic Message Passing• INPUT: n schemas and m mappings• OUTPUT: quality measures for the mappings

Page 10: ICDE 06         04.05.06

10

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Probabilistic Message Passing

• Link-based analysis of the PDMS-Automatically deriving quality measures for the mappings

• Transitive closures of mapping operations-Mapping Cycles -Parallel Paths

q VS m3(m4(m0(q)))

m0

m3

m4

f0

art/Creator? VS art/creatDate?

q:art/Creator?

Page 11: ICDE 06         04.05.06

11

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

On Cycles / parallel paths

m0 m1

m2m3

m4m5

f0

Page 12: ICDE 06         04.05.06

12

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Computing a Marginal for one cycle

• P(m0, m3, m4, f0) =

P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,)

• P(m0| f0)= m3, m4 P(m0, m3, m4, f0) P(f0)-1

• But: feedbacks on different cycles are correlated– One wrong mapping will affect several cycles/paths– Need to express a global probabilistic model for the

mapping graph

observedunknown

Page 13: ICDE 06         04.05.06

13

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

A Brief Intro to Factor-Graphs

• g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

Page 14: ICDE 06         04.05.06

14

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Deriving PDMS Factor-Graphs

Abductive reasoning on transitive closures of mappings

a priori informationon mapping

Page 15: ICDE 06         04.05.06

15

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

PDMS Factor-Graphs

• Cyclic graph– Junction Tree? Clustering / Stretching of variables?

• Centralization• Computational + communicational overhead

– Iterative Sum-Product • Approximate results

• How to perform iterative sum-product by message passing on the mapping graph?– Message passing in factor graph does not correspond to

connectivity of mapping graph– We want to rely on decentralized computations only

• Locality VS Globality of nodes in the factor graph– Mappings: local– Feedback factor: common, global knowledge– Observed feedback variables: neighborhood

Page 16: ICDE 06         04.05.06

16

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Embedded Message-Passing (1)

Page 17: ICDE 06         04.05.06

17

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Embedded Message-Passing (2)

Page 18: ICDE 06         04.05.06

18

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Message Passing

• Decentralized computations• Computationally inexpensive

– Sums and Products

• Message-Passing Schedules– Periodic– Lazy (piggybacking on query forwarding)

• No message overhead

Page 19: ICDE 06         04.05.06

19

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Implemented System

• Schemas– Import from OWL (Web Ontology Language)

• Mappings– KnowledgeWeb Ontology Alignment API– Import from RDF/XML– Automated on-the-fly creation– Comparison to standard alignments

Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing

Query routing based on the quality measuresPrecision / recall tradeoff

Page 20: ICDE 06         04.05.06

20

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Some (Preliminary) Results: Convergence

(undirected example graph, prior 0.7 delta 0.1)

Page 21: ICDE 06         04.05.06

21

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Fault-tolerance (faulty links)

(undirected example graph, prior 0.8 delta 0.1)

Page 22: ICDE 06         04.05.06

22

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Detecting Erroneous Mappings

(random network of 50 schemas and 200 mappings, no prior information)

Page 23: ICDE 06         04.05.06

23

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Conclusions

• Deriving quality measures for PDMS mappings– Automated process– Decentralized computations– Based on agreements on conceptualizations

• Emergent Semantics

• Current work– More expressive mappings

• E.g., subsumption– Integration in the GridVine semantic overlay network– Application to other domains

• Web Services composition?

Page 24: ICDE 06         04.05.06

24

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities

Thank you for your attention

Web page: lsirpeople.epfl.ch/cudre

• Questions?

Page 25: ICDE 06         04.05.06

25

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities