![Page 1: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/1.jpg)
An Effective Rule Miner for Instance Matching in a Web of Data
Xing Niu, Shu Rong, Haofen Wang and Yong YuShanghai Jiao Tong University
2012.10.31 @ CIKM2012
![Page 2: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/2.jpg)
Agenda
Introduction
Workflow
Experiments
Summary
Page 2
![Page 3: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/3.jpg)
Introduction - the Linking Open Data Project
Page 3
![Page 4: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/4.jpg)
Introduction - Equivalent Instances
DBpedia
GeoSpecies
Page 4*photo by Brenda Zaun (form Wikipedia)
dbpedia:Nene_(bird)
foaf:name “Nene”
dbpprop:binomial “Branta sandvicensis”
dbpedia-owl:phylum dbpedia:Chordate
…
gs:nene
gs:hasCommonName “Nene”
gs:hasCanonicalName “Branta sandvicensis”
gs:inPhylum gs:Chordate
…
![Page 5: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/5.jpg)
Introduction – Problem Definition
Domain/dataset-specific methods General-purpose approaches
Requiring manually defined matching rules (i.e. link specifications)
Focusing on similarity metrics
Page 5
DBpediaDBpedia NewNew
![Page 6: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/6.jpg)
Introduction – Our Solution
Automatically discovering and refining dataset-specific matching rules in iterations Deriving these rules by finding the most discriminative data
characteristics for a given data source pair.
Page 6
![Page 7: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/7.jpg)
Workflow – Pre-processing
For each pair of existing matched instances, their property-value pairs are merged.
E.g. dbpedia:Nene_(bird) = gs:nene foaf:name:“Nene” dbpprop:binomial:“Branta sandvicensis” dbpedia-owl:phylum:dbpedia:Chordate gs:hasCommonName:“Nene” gs:hasCanonicalName:“Branta sandvicensis” gs:inPhylum:gs:Chordate
Page 7
![Page 8: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/8.jpg)
Workflow – Mining Properties Equivalences
Statistical Schema Induction [Völker et.al., ESWC2011] E.g. dbpedia:Nene_(bird) = gs:nene
Page 8
Values Property_1 Property_2
“Nene” foaf:name gs:hasCommonName
“Branta sandvicensis” dbpprop:binomial gs:hasCanonicalName
“Panda” foaf:name gs:hasCommonName
“Coconut” foaf:name gs:hasCommonName
… … …
![Page 9: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/9.jpg)
Workflow – Mining Matching Rules
Association rule mining, agian. E.g. for the transaction dbpedia:Nene_(bird) = gs:nene, it
has items: valueOf(foaf:name) = valueOf(gs:hasCommonName) valueOf(dbpprop:binomial) = valueOf(gs:hasCanonicalName) valueOf(dbpedia-owl:phylum) = valueOf(gs:inPhylum) …
Page 9
![Page 10: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/10.jpg)
Workflow – Mining Matching Rules (con’t)
Matching rule: dbpedia:x and gs:x are matched, iff. valueOf(foaf:name) = valueOf(gs:hasCommonName) and valueOf(dbpprop:binomial) = valueOf(gs:hasCanonicalName) and valueOf(dbpedia-owl:phylum) = valueOf(gs:inPhylum)
Each matching rule brings a confidence value with it. We will introduce it later.
Page 10
![Page 11: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/11.jpg)
Workflow – Generating Matches
Applying the obtained rule(s) on the unlabeled data to generate matches’ candidates. Each candidate also has a confidence value which equals to
the confidence of its corresponding rule.
The combiner is used to combine confidence values of a match’s candidate. Dempster’s rule [G. Shafer, 1976]
Page 11
![Page 12: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/12.jpg)
Workflow – the Wrapper Algorithm
The wrapper is an implementation of Expectation-Maximization iterations. Refining matching rules, discovering new matches
Page 12
ExpectationMaximization
![Page 13: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/13.jpg)
Workflow – E-step
The E-step estimates the missing data using the observed data and the current estimate for the parameters.
The E-step estimates the matches using the current estimate for the matching rules.
Page 13
Expectation
![Page 14: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/14.jpg)
Workflow – M-step
The M-step computes parameters maximizing the likelihood function as the data estimated in E-step are used in lieu of the actual missing data. M: matches Θ: parameters
Page 14
Maximization
![Page 15: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/15.jpg)
Workflow – The Likelihood Function
For a set of given matches, the proximity is reflected in two aspects: correctness (precision) completeness (recall)
Without complete reference matches (in real-world data), it is difficult to precisely evaluate either of them.
An alternative measurement: optimizing the precision takes priority and obtaining all potential matches on the premise of that precision value.
The likelihood function can be continued as:
Page 15
![Page 16: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/16.jpg)
Workflow - Approximate Precision
Assuming that no equivalent instances exist in a single data source, we can infer that an instance is equivalent to at most one another from the other data source.
Incorrect matches in M may result in a node connecting to more than one other node, which is contrary to the assumption.
Page 16
P=4/4=1 P=2/4=0.5
![Page 17: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/17.jpg)
Experiments - Datasets
DBpedia is a hub data source in LOD. It structures Wikipedia knowledge and make this structured information available on the Web.
GeoNames, LinkedMDB and GeoSpecies
Task: discovering matches between DBpedia and the other three domain-specific data sources.
Statistics
Page 17
Datasets Instances References Baselines
DBpedia 4,071,600 - -
GeoNames 8,147,136 317,433 manually
LinkedMDB (Film) 97,471 16,447 ODDLinker
GeoSpecies 20,939 11,490 unknown
![Page 18: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/18.jpg)
Experiments – Precisions
Sampling a certain number of output matches. The X-axis indicates the proportions of selected seeds in
complete reference matches.
Page 18
![Page 19: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/19.jpg)
Experiments – Newly Found Matches
The match space constituted by reference matches and newly found matches
Page 19
![Page 20: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/20.jpg)
Experiments – Running Time
Java@Map/Reduce The Hadoop cluster contains 40 nodes. Each node is a PC (Intel Core 2 Quad 2.66GHz CPU, 2GB RAM)
can run 3 Maps + 3 Reduces simultaneously. This is a shared cluster and we occupy 50 slots.
We sample some typical running times for a single iteration.
Page 20
![Page 21: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/21.jpg)
Experiments – Running Time (con’t)
The most time-consuming phases are “data pre-processing” and “getting initial matches”.
Page 21
![Page 22: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/22.jpg)
Summary
Contributions We proposed a general-purpose approach to automatically
mine dataset-specific matching rules based on the EM algorithm.
We introduced a graph-based metric to estimate likelihood (precision) and Dempster’s rule to combine confidence values.
*We discussed some extensions to our approach in order to fit the different requirements of various practical applications.
Conclusions We carried out experiments on several real-world datasets.
The results demonstrated the correctness of matches discovered by our approach (precision >0.96 in most cases).
We also shown more matches are found than existing references.
The whole process can be implemented parallel.
Page 22
![Page 23: An Effective Rule Miner for Instance Matching in a Web of Data](https://reader035.vdocuments.mx/reader035/viewer/2022062406/55975d8f1a28ab961a8b4725/html5/thumbnails/23.jpg)