adopting ontologies for multisource identity resolution
DESCRIPTION
Adopting Ontologies for Multisource Identity Resolution. Milena Yankova, Horacio Saggion Hamish Cunningham Department of Computer Science, The University of Sheffield. Overview. Introduction Knowledge representation Usage of ontologies in identity resolution Case-study & Evaluation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/1.jpg)
Adopting Ontologies for Multisource Identity Resolution
Milena Yankova, Horacio SaggionHamish Cunningham
Department of Computer Science, The University of Sheffield
![Page 2: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/2.jpg)
Overview
• Introduction • Knowledge representation • Usage of ontologies in identity resolution• Case-study & Evaluation • Conclusion and Further Work
![Page 3: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/3.jpg)
Introduction• Identity resolution aims at identifying the newly
presented facts and linking them to their previous mentions Our main
• hypothesis is that– variations of one and the same fact can be recognised, – duplications removed and – their aggregation actually increases the correctness of fact
extraction.• We use an ontology for internal and resulting
knowledge representational formalism. • It not only contains the representation of the domain,
but also known entities and properties.
![Page 4: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/4.jpg)
Knowledge Representation via Ontologies
• Ontologies have been chosen because of its detailed entity description that is complemented with semantic information.
• The expected benefit from using semantic representation the ability to recognise not only the type/class of the objects, but also the individual instances they refers to. – For example, different appearances of “M&S" on different sources
(e.g. web pages) are extracted and collected as a single instance which all mentions point to.
• The semantic linkup of the identified objects guaranties more detailed description as opposed to a simple syntactic representation.
• In this way it provides more details, which serving as evidence can improve the accuracy of object comparison.
![Page 5: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/5.jpg)
Source of information
• In this application we have two sources of information ( company profiles): – A database of manually collected company details– Profiles extracted from web pages
![Page 6: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/6.jpg)
Mapping Databases to Ontologies
• The database schema is the data description that holds the meaning of the data
• binging databases to other knowledge representational formalism e.g. ontologies requires deep understanding and domain expertise
• It is usually done manually producing mapping between the particular database schema and given ontology
• We use company profiles stored in a MySql Relational Database Management System which has been manually mapped to the Musing ontology using scripts
![Page 7: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/7.jpg)
Information Extraction
![Page 8: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/8.jpg)
Ontology-based Information Extraction
• Ontology-based information extraction which aims at identifying in text concepts and instances from an underlying domain model specified in an ontology.
• The extraction prototype uses some default linguistic processors from GATE
• Custom application rules for concept identification are specified in regular grammars implemented in the JAPE language.
![Page 9: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/9.jpg)
Ontologies in IDRF
• Our approach to the identity problem has been implemented as Identity Resolution Framework (IDRF)
• It uses an ontology for internal and resulting knowledge representational formalism
• It is based on the PROTON ontology, which can be extended, e.g. for our particular domain of company profiling
![Page 10: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/10.jpg)
Identity Class Models• Execution of the IdRF is based on what we call Class Models - that
handle the differences of the entity types represented as ontology classes.
• Each class model is expressed by a single formula based on first order probabilistic logic
• Each formula is manually composed by combining predicates by the usual logical connectives like \&", \j", \not" and \)".
• Class models are used in two stages of the framework pipeline:– during the retrieval of potential matching candidates from the
ontology - applying a strict criteria; – During actual comparison of entities potential matching pairs using a
soft criteria. • They are also evaluated differently depending on which component
use them.
![Page 11: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/11.jpg)
Example of Class Model definition
![Page 12: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/12.jpg)
Pre-filtering
• It restrict the whole amount of ontology instances to a reasonable number, to which the source entity will be compared.
• In this case the engine does not formally evaluate the class model/formula but composes a SeRQL or SQL query.
• The query embodies the model strong equivalency criteria
![Page 13: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/13.jpg)
Example for Pre-filtering Query • “MARKS & SPENCER“
query according to the class model for "musing:Company"
![Page 14: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/14.jpg)
Evidence Collection (1)
• This component calculates the similarity between two objects based on their class model,
• It is expressed by a probabilistic logic formula resulting in a real number from 0 to 1. – “0” means that the given entities are totally different – “1” means that they are absolutely equivalent. – any value between 0 and 1 the probability these
entities to be equivalent
![Page 15: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/15.jpg)
Evidence Collection (2)
• The value fro each of the predicates in the formula is calculated according to the algorithm it present– Predicate values are combined according to the
logical connectives in the formula– In this setting the usual logical connectives are
expressed as arithmetic expressions, e.g. aVb = a+b-ab
![Page 16: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/16.jpg)
Data Integration
• It is this third stage of identification process• It encodes the strength of the presented
evidence for choosing the candidate favored by the Class Model.
• The successful candidate must pass a threshold which balances the precision and recall of the application.
![Page 17: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/17.jpg)
Decision Threshold
• A pre-set threshold determine whether to registers the matches as successful.
• We have used ROC curve analysis to sent the threshold of 0.4. which gives the best performance in our application
![Page 18: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/18.jpg)
Case-study
• Our case-study is focused on company profiling.• We have automatically extracted hundreds of
company profiles from different web sites, e.g. http://uk.finance.yahoo.com
• Our database is populated with about 1,8M manually collected company profiles provided by http://www.marketlocation.com
• The evaluation has targeted a set of 310 extracted UK companies compared to the database
![Page 19: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/19.jpg)
Evaluation of the IDRF
• The accuracy of identity resolution is very promising (89% F-measure)
• Anther experiment on automatically extracted vacancies shows similar results
![Page 20: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/20.jpg)
Evaluation of the IE
• The Recall of automatically extracted company attributes is improved from 92% to 97% after integration
• The Precision rise slightly from 70% to 73%
![Page 21: Adopting Ontologies for Multisource Identity Resolution](https://reader036.vdocuments.mx/reader036/viewer/2022062410/568156f7550346895dc4a0d0/html5/thumbnails/21.jpg)
Conclusion and future work
• IRDF is a general framework for identity resolution which is based on ontologies
• adapted to ontology-based information extraction applications.
• future work - how uniqueness of the details and their number influence the process of identification