identifying relevant sources for data linking using a semantic web index
DESCRIPTION
TRANSCRIPT
![Page 1: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/1.jpg)
Identifying Relevant Sources for Data Linking using a Semantic Web Index
Andriy NikolovMathieu d’AquinKnowledge Media InstituteThe Open University, UK
![Page 2: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/2.jpg)
How to link a new dataset?
• What other repositories contain relevant data which I should link to?– Select the external repository
• How to select the relevant data instances to link?– Select the relevant classes within the chosen repository
TV programsmovies
pieces of music
LinkedMDB
DBPedia
Freebase
MusicBrainz
?
actors
composers bestbuy
![Page 3: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/3.jpg)
data.open.ac.uk
Selection criteria
• Additional information about local instances• Popularity• Degree of overlap
Publication data
DBPedia
DBLP
rae:RKBExplorer
![Page 4: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/4.jpg)
Available information• Additional information about resources
– Schema ontology– Test examples
• Popularity– VoiD descriptors
• Linking repositories– Catalog of repositories (CKAN)
• Degree of overlap– VoiD descriptors (only topic relevance)– Relevant info hard to obtain on the client
side
![Page 5: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/5.jpg)
ApproachSearch for sources with potentially high degree of overlap
– Use a subset of entity labels from the original dataset as keywords for entity search
![Page 6: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/6.jpg)
Approach
Aggregate results– Group instances occurring
in returned result sets by their source repositories
![Page 7: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/7.jpg)
Approach
Rank sources – Sort by number of
individuals returned in search results
![Page 8: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/8.jpg)
Approach
Select “most relevant” class – Select the class in each
source, which covers most of instances
![Page 9: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/9.jpg)
Issues: imprecise results
• Main cause: ambiguous instance labels• Inclusion of irrelevant sources
– E.g., DBLP for movie score composers• Selection of inappropriate classes within
the selected source– Too generic: e.g., dbpedia:Person vs
dbpedia:MusicArtist– Irrelevant: e.g., akt:Publication-Reference
(journal volume) vs akt:Journal
![Page 10: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/10.jpg)
Filtering resultsDetermine potentially irrelevant classes
– Use state-of-the-art schema matching to select relevant classes
![Page 11: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/11.jpg)
Filtering resultsFilter out irrelevant search results
– Only consider search result instances belonging to “approved” classes
![Page 12: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/12.jpg)
Preliminary experiments
• Datasets– ORO journals (data.open.ac.uk): 3110
instances– LinkedMDB films: 400 instances– LinkedMDB music contributors: 400
instances• External components
– Semantic index: Sig.ma– Ontology matching techniques: CIDER,
instance-based schema mappings retrieved from BTC2009 dataset
![Page 13: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/13.jpg)
Preliminary experiments
Before filtering +/- After filtering +/-rae2001 (RKB) + rae2001 (RKB) +dotac (RKB) + DBPedia +DBPedia + dblp.l3s.de +oai (RKB) + Freebase +dblp.l3s.de + DBLP (RKB) +wordnet (RKB) - eprints (RKB) +bibsonomy -eprints (RKB) +Freebase +www.examiner.com -
• Performance measure:– Proportion of relevant sources among the top-10
returned results
![Page 14: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/14.jpg)
• Summary:– Top-ranked returned repositories are largely
relevant from the point of view of linking– Filtering using schema matching techniques
greatly improves precision (all remaining sources are relevant)
– … but at the expense of some recall
Preliminary experiments
![Page 15: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/15.jpg)
Future work• Improving the quality of results
– E.g., estimating the potential loss of precision/recall for different filtering decisions
• Integrating with the data linking workflow– Automatically pre-configuring the data
linking algorithm• Repository search as a potentially useful
semantic search use case (in addition to entity and document search)
![Page 16: Identifying Relevant Sources for Data Linking using a Semantic Web Index](https://reader033.vdocuments.mx/reader033/viewer/2022061104/53ff99168d7f724c088b46b7/html5/thumbnails/16.jpg)
Questions?
Thanks for your attention