ruleml2015 - using substitutive itemset mining framework for finding synonymous properties in linked...

16
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked Data Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski Poznan University of Technology, Poland August 3rd, 2015 RuleML 2015 Mikolaj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked August 3rd, 2015 RuleML 20 / 15

Upload: ruleml

Post on 18-Aug-2015

28 views

Category:

Science


0 download

TRANSCRIPT

Using Substitutive Itemset Mining Framework forFinding Synonymous Properties in Linked Data

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski

Poznan University of Technology, Poland

August 3rd, 2015RuleML 2015

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 1

/ 15

Outline

Motivating Scenario

Substitutive Sets Mining

Finding Synonymous Properties with Substitutive Sets Mining

Summary

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 2

/ 15

Motivating scenario 1/3

Wiki    (mappings    

Wikipedia  infoboxes  -­‐>  DBpedia  ontology)  

Norah_Jones    

Denton,_Texas  dbpedia-­‐prop:origin  

Mark_Knopfler   Gosforth  dbpedia-­‐owl:hometown  

Peter_Gabriel   Godalming  dbpedia-­‐owl:hometown  

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 3

/ 15

Motivating scenario 2/3

DBpedia 2014 ontology has 1310 object and 1725 data properties

Many large Linked Data use relatively lightweight schemas with ahigh number of object properties

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 4

/ 15

Motivating scenario 3/3

Wiki    (mappings    

Wikipedia  infoboxes  -­‐>  DBpedia  ontology)  

dbpedia-­‐owl:MusicalArAst     dbpedia-­‐owl:PopulatedPlace    

Norah_Jones    

Denton,_Texas  dbpedia-­‐prop:origin  

Mark_Knopfler   Gosforth  dbpedia-­‐owl:hometown  

Peter_Gabriel   Godalming  dbpedia-­‐owl:hometown  

dbpedia-­‐owl:MusicalArAst     dbpedia-­‐owl:PopulatedPlace    

dbpedia-­‐owl:MusicalArAst     dbpedia-­‐owl:PopulatedPlace    

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 5

/ 15

Substitutive Sets Mining Framework

Frequent(Itemset(Mining(

Subs1tu1ve(Set(Genera1on(

Transac1on(DB(

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 6

/ 15

Frequent Itemsets

I = {i1, i2, . . . , im} - a set of items

DT = {t1, t2, . . . , tn}, where ∀iti ⊆ I -a database of transactions

support(X ) = ∣{t∈DT ∶X⊆t}∣∣DT ∣

ID Items

1 Nachos, Pepsi, Salsa2 Nachos, Coca-Cola, Salsa3 Nachos, Coca-Cola4 Nachos, Pepsi, Salsa5 Milk, Bread

Frequent Itemset Support

{Nachos} 80%{Salsa} 60%{Coca-Cola} 40%{Pepsi} 40%{Nachos, Salsa} 60%{Nachos, Coca-Cola} 40%{Nachos, Pepsi} 40%{Salsa, Pepsi} 40%

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 7

/ 15

Covering Set

CS(i ∣L) = {X ∈ L ∶ {i} ∪X ∈ L}

coverage(i ∣L) = ∣CS(i ∣L)∣

Frequent Itemset

{Nachos}{Salsa}{Coca-Cola}{Pepsi}{Nachos, Salsa}{Nachos, Coca-Cola}{Nachos, Pepsi}{Salsa, Pepsi}

i CS(i) coverage

{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3{Salsa} {{Nachos}} 1{Coca-Cola} {{Nachos}} 1{Pepsi} {{Nachos}, {Salsa}} 2

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 8

/ 15

Substitutive Sets

A two-element itemset {x , y} is a substitutive itemset, if:

x ∈ L1,

y ∈ L1,

support({x} ∪ {y}) < ε, where ε is a user-defined thresholdrepresenting the highest amount of noise in the data allowed,∣CS(x ∣L)∩CS(y ∣L)∣

max{∣CS ∣L(x)∣,∣CS(y ∣L)∣} ⩾ mincommon.

i CS(i) coverage

{Nachos} {{Salsa}, {Coca-Cola}, {Pepsi}} 3{Salsa} {{Nachos}} 1{Coca-Cola} {{Nachos}} 1{Pepsi} {{Nachos}, {Salsa}} 2

∣CS(Pepsi)∩CS(Coca−Cola)∣max{∣CS(Pepsi)∣,∣CS(Coca−Cola)∣}

= 0.5

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 9

/ 15

Create Substitutive Sets RapidMiner operator

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 10

/ 15

Use Case: DBpedia

DBpedia knowledge base version 2014

sets of 3–item transactions {c1,p, c2}, where c1 and c2 classes ofsubject and object of RDF triple, and p property connecting s and o

SELECT ?c1 ?p ?c2WHERE {?s rdf:type dbpedia-owl:Organization .?s ?p ?o .?s rdf:type ?c1 .?o rdf:type ?c2 .FILTER(?p != dbpedia-owl:wikiPageWikiLink) .FILTER(?p != rdf:type) .FILTER(?p != dbpedia-owl:wikiPageExternalLink) .FILTER(?p != dbpedia-owl:wikiPageID) .FILTER(?p != dbpedia-owl:wikiPageInterLanguageLink) .FILTER(?p != dbpedia-owl:wikiPageLength) .FILTER(?p != dbpedia-owl:wikiPageOutDegree) .FILTER(?p != dbpedia-owl:wikiPageRedirects) .FILTER(?p != dbpedia-owl:wikiPageRevisionID)}

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 11

/ 15

Transaction generation

dbpedia'owl:MusicalAr2st44 dbpedia'owl:PopulatedPlace44

Norah_Jones44

Denton,_Texas4dbpedia'prop:origin4

Mark_Knopfler4 Gosforth4dbpedia'owl:hometown4

dbpedia'owl:MusicalAr2st44 dbpedia'owl:PopulatedPlace44

s4

s4 p4

p4 o4

o4

c14

c14

c24

c24

Transactions

{c1 dbpedia-owl:MusicalArtist, dbpedia-owl:hometown, c2 dbpedia-owl:PopulatedPlace }{c1 dbpedia-owl:MusicalArtist, dbpedia-prop:origin , c2 dbpedia-owl:PopulatedPlace }...

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 12

/ 15

Experimental Setup

FP-Growth: min number of itemsets = 500, max number of retries =15, min support: 1.0E-4),

Create Substitutive Sets: min support = 1.0E-4, min common= 0.7, epsilon =1.0E-5,

a sample of 100k results per each query,

desktop computer with 12GB RAM and CPU Intel(R) Core(TM)i5-4570 3.20GHz,

a single run of mining substitutive sets (for a single class and 100ktransactions) took several seconds on average (ranging from 2s to12s)

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 13

/ 15

Sample substitutive properties for the class Organisation

Item X Item Y Common

dbpprop:parentOrganization dbo:parentOrganisation 1.000dbpprop:owner dbo:owner 1.000dbpprop:origin dbo:hometown 1.000dbpprop:headquarters dbpprop:parentOrganization 1.000dbpprop:formerAffiliations dbo:formerBroadcastNetwork 1.000dbo:product dbpprop:products 1.000dbpprop:keyPeople dbo:keyPerson 0.910dbpprop:commandStructure dbpprop:branch 0.857dbo:schoolPatron dbo:foundedBy 0.835dbpprop:notableCommanders dbo:notableCommander 0.824dbo:recordLabel dbpprop:label 0.803dbo:headquarter dbo:locationCountry 0.803dbpprop:country dbo:state 0.753

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 14

/ 15

Summary

Introduced a model for substitutive itemsets mining

Preliminary tests of this model within the task of deduplication ofobject properties in an RDF dataset (DBpedia)

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 15

/ 15

Acknowledgements

Foundation for Polish Science under the POMOST programme,cofinanced from European Union, Regional Development Fund (NoPOMOST/2013-7/8) (2013-2015)

EU FP7 ICT-2007.4.4 (No 231519) ”e-LICO: An e-Laboratory forInterdisciplinary Collaborative Research in Data Mining andData-Intensive Science” (2009-2012)

Thanks to Ewa Kowalczuk for debugging the RapidMiner plugin

Miko laj Morzy, Agnieszka Lawrynowicz, Mateusz Zozulinski ( Poznan University of Technology, Poland )Using Substitutive Itemset Mining Framework for Finding Synonymous Properties in Linked DataAugust 3rd, 2015 RuleML 2015 16

/ 15