icde 06 04.05.06
DESCRIPTION
ICDE 06 04.05.06. Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt). Overview of the talk. Data Integration in Large-Scale Information Systems - PowerPoint PPT PresentationTRANSCRIPT
1
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
ICDE 06 04.05.06
Probabilistic Message Passing in Peer Data Management Systems
Philippe Cudré-Mauroux, EPFL
Joint work with:
Karl Aberer (EPFL)Andras Feher (T.U. Darmstadt)
2
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Overview of the talk
• Data Integration in Large-Scale Information Systems– Peer Data Management Systems (PDMS)
• Query Routing in PDMS– Precision / Recall tradeoff
• Probabilistic Message Passing– Deriving quality measures for the mappings
• Conclusions
3
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Classical Data Integration: LAV/GAV
• Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources
• Not applicable to large-scale, decentralized contexts– Scale (upper ontologies?)– Churn– Autonomy
• How can we foster semantic interoperability in decentralized settings?
Date
myDate yourDate
m(yourDate) = Datem(myDate) = Date
4
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Peer Data Management Systems (1)
Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%"
<Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells </Item> <Item>Royal Council</Item> </Bag> </Subject> …</Photoshop_Image>
Photoshop(own schema)
<WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> …</WinFSImage>
WinFS (known schema)
T12 =<Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage
Q2=<GUID>$p/GUID</GUID> FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"
Extending data integration techniques to decentralized settings
5
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Peer Data Management Systems (2)
• Pairwise mappings• Local mappings overcome global heterogeneity
– Iterative query reformulation
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate><xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate>
date?
<es:cDate> 05/08/2004 </es:cDate>
<myRDF:Date> Jan 1, 2005 </myRDF:Date>
articleweatheres:cDate xap:CreateDate
es:cDate
myR
DF:D
ate
myR
DF:
Dat
e
xap
:Mod
ifyD
ate
6
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
PDMS Examples
• Some academic systems– Piazza– Hyperion– BestPeer– GridVine– …
• Out there on the Internet– The Sequence Retrieval System (SRS)
• 388 schemata (May 05, EBI repository)• 518 mappings (ID <-> ID)• Power-law distribution of node degrees• Clustering coefficient = 0.32• Diameter = 9
– Semantic Overlay Networks• P2P + semi-structured data
– The Semantic Web
7
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Data in large-scale PDMS
• Large-Scale PDMS
– Number of sources > 100– Unreliable data
• Autonomy– Semi-structured data
• E.g., XML/RDF– No integrity constraints– No transactions– Simple SP queries
• E.g., triple patterns, ranking
– Schemata created by end users
– Network churn
• Distributed Databases
– Number of sources < 100– Consistent data
• Coordination– Structured data
• E.g., Relational data model– Integrity constraints– Transactions– Powerful queries
• E.g., SQL, aggregation– Schemas created by
administrators– Relatively Fixed topology
VS
8
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Problem: Precision/Recall Tradeoff (1)
• Semantic Query routing– To whom shall I forward a query posed against my local
schema?
• Some (most) mappings will be (partially) faulty– Low expressive power of mapping languages
• samePropertyAs / sameClassAs / subclassOf• … or event worse (Microformats)
– Automatic schema alignment techniques– Different views on conceptualizations
• Local query resolution– Low recall
• Flooding (PDMS so far)– Low precision
9
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Problem: Precision/Recall Tradeoff (2)
• Standard deductive integration is not sufficient– Uncertainty on mappings and conceptualizations
• Probabilistic Message Passing– Deriving quality measures for the mappings
• Reduces uncertainty• Used to route query / optimize mappings
– Based on a notion of agreement on conceptualizations• Decentralized decision making, Emergent Semantics
• From Schema Matching to Probabilistic Message Passing– Automatic Schema Matching
• INPUT: 2 schemas + data• OUTPUT: 1 mapping
– Probabilistic Message Passing• INPUT: n schemas and m mappings• OUTPUT: quality measures for the mappings
10
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Probabilistic Message Passing
• Link-based analysis of the PDMS-Automatically deriving quality measures for the mappings
• Transitive closures of mapping operations-Mapping Cycles -Parallel Paths
q VS m3(m4(m0(q)))
m0
m3
m4
f0
art/Creator? VS art/creatDate?
q:art/Creator?
11
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
On Cycles / parallel paths
m0 m1
m2m3
m4m5
f0
12
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Computing a Marginal for one cycle
• P(m0, m3, m4, f0) =
P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,)
• P(m0| f0)= m3, m4 P(m0, m3, m4, f0) P(f0)-1
• But: feedbacks on different cycles are correlated– One wrong mapping will affect several cycles/paths– Need to express a global probabilistic model for the
mapping graph
observedunknown
13
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
A Brief Intro to Factor-Graphs
• g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)
14
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Deriving PDMS Factor-Graphs
Abductive reasoning on transitive closures of mappings
a priori informationon mapping
15
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
PDMS Factor-Graphs
• Cyclic graph– Junction Tree? Clustering / Stretching of variables?
• Centralization• Computational + communicational overhead
– Iterative Sum-Product • Approximate results
• How to perform iterative sum-product by message passing on the mapping graph?– Message passing in factor graph does not correspond to
connectivity of mapping graph– We want to rely on decentralized computations only
• Locality VS Globality of nodes in the factor graph– Mappings: local– Feedback factor: common, global knowledge– Observed feedback variables: neighborhood
16
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Embedded Message-Passing (1)
17
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Embedded Message-Passing (2)
18
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Message Passing
• Decentralized computations• Computationally inexpensive
– Sums and Products
• Message-Passing Schedules– Periodic– Lazy (piggybacking on query forwarding)
• No message overhead
19
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Implemented System
• Schemas– Import from OWL (Web Ontology Language)
• Mappings– KnowledgeWeb Ontology Alignment API– Import from RDF/XML– Automated on-the-fly creation– Comparison to standard alignments
Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing
Query routing based on the quality measuresPrecision / recall tradeoff
20
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Some (Preliminary) Results: Convergence
(undirected example graph, prior 0.7 delta 0.1)
21
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Fault-tolerance (faulty links)
(undirected example graph, prior 0.8 delta 0.1)
22
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Detecting Erroneous Mappings
(random network of 50 schemas and 200 mappings, no prior information)
23
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Conclusions
• Deriving quality measures for PDMS mappings– Automated process– Decentralized computations– Based on agreements on conceptualizations
• Emergent Semantics
• Current work– More expressive mappings
• E.g., subsumption– Integration in the GridVine semantic overlay network– Application to other domains
• Web Services composition?
24
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities
Thank you for your attention
Web page: lsirpeople.epfl.ch/cudre
• Questions?
25
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities