evaluating class assignment semantic redundancy on ...ceur-ws.org/vol-1743/paper12.pdfness (zaveri...

Evaluating Class Assignment Semantic Redundancy on Linked Datasets

Leandro MendozaCONICET, Argentina

LIFIA, Facultad de Informática,UNLP, Argentina

Alicia Dı́azLIFIA, Facultad de Informática,

UNLP, Argentina

Abstract

In this work we address the concept of se-mantic redundancy in linked datasets con-sidering class assignment assertions. Wediscuss how redundancy can be evaluatedas well as the relationship between redun-dancy and three class hierarchy aspects:the number of instances a class has, num-ber of class descendants and class depth.Finally, we performed an evaluation on theDBpedia dataset using SPARQL queriesfor data redundancy checks.

1 Introduction

The amount of interlinked knowledge bases builtunder Semantic Web technologies and followingthe linked data (Heath and Bizer, 2011) princi-ples has increased significantly last years. Theseknowledge bases (also known as linked datasets)contain information that associates Web entities(called resources) with well-defined semanticsthat specifies how these entities should be in-terpreted. In most linked datasets a substantialamount of data corresponds to class assignmentassertions, that is, information that specifies re-sources (or individuals) as instances of certainclasses. In this sense, resources are typified us-ing classes usually defined through ontologies andorganized into class hierarchies. Several differ-ent ontologies can be combined to classify re-sources within the same dataset giving rise to hugeand complex interlinked structures that can sufferfrom data quality problems (Hogan et al., 2010).Thus, the use of practical mechanisms to handleknowledge conciseness becomes increasingly im-portant to improve the overall dataset quality andthe study of redundancy on class assignments as-sertions aims to contribute in this way. From adata quality perspective, class assignment redun-

dancy is related with the concept of extensionalconciseness which has been defined in (Zaveri etal., 2015) as “the case when the data does notcontain redundant objects at instance level”. Inthis scenario, redundancy means that a resource isspecified as member of a class when it is not nec-essary, either because the information is explicitlyduplicated or because it can be derived from infor-mation that already exists. Current works that havedealt with semantic redundancy on linked datasetsimplement algorithms based on graph pattern dis-covering techniques. In contrast, our work pro-poses a simplified approach based on SPARQLqueries and considering class assignment asser-tions. We discuss how redundancy can be eval-uated and perform an evaluation over the DBpe-dia dataset (Lehmann et al., 2015) in order to un-derstand the relationship between redundancy andthree class hierarchy aspects: the number of in-stances a class has, the class depth and its numberof descendants. This approach may be useful forlinked data users who need to measure semanticdata redundancy in a practical way, understand itsorigin and detect when it may be useful (e.g. toimprove performance) or when it can affect nega-tively the knowledge base (e.g. misuse of classeswhen typifying resources). The following sectionsare organized as follows: sections 2 and 3 givesome background definitions and related work, re-spectively. Section 4 introduces the redundancydefinition adopted and discusses some of the alter-natives to address it on linked datasets. Section 5shows the evaluation results. Finally, some con-clusions and further work are given in section 6.

2 Background

In the linked data context, datasets are knowl-edge bases described using the RDF1 data model

1https://www.w3.org/RDF/

103

and published following the linked data princi-ples2. These datasets are collection of assertionsabout resources specified following the “subjectpredicate object” pattern. Assertions are RDFtriples and resources may be anything identifi-able by an HTTP URI. Knowledge representationmechanisms like RDFS3 and OWL4 extend RDFand allow datasets to be augmented with more ex-pressive semantics. For example, it is possibleto describe ontologies by specifying classes andrelationships between them (e.g. “SoccerPlayerrdfs:subClassOf Atlhete”) and to specify re-sources as member of those classes (e.g. “Li-onel Messi rdf:type SoccerPlayer”). From anoverall perspective, information contained in thesedatasets can be split into two levels: schema leveland instance level. Schema level refers to ter-minological knowledge (known as TBox), for ex-ample, classes, properties and their relationships.On the other hand, instance level refers to asser-tional knowledge (known as ABox), that is, propo-sitions about entities of a specific domain of in-terest. An important type of assertional knowl-edge corresponds to class assignments, that is,RDF statements of the form “resource rdf:typeclass” used to specify resources as members ofcertain classes. The most common way to retrievethis information from linked datasets is through aSPARQL5 endpoint. These endpoints are web ser-vices that accept SPARQL queries and return in-formation that match with a given pattern. In thiswork we will use this mechanism to detect redun-dant class assignments.

3 Related Work

In the linked data literature, redundancy is re-lated with the data quality dimension of concise-ness (Zaveri et al., 2015) and has been studiedand categorized from syntactic to semantic andfrom schema to instance levels (Pan et al., 2014).From a syntactic perspective most of the existingcompression techniques focus on RDF serializa-tion. On the other hand, from a semantic perspec-tive, just a few works addressed redundancy. In(Wu et al., 2014) authors propose a graph based

2https://www.w3.org/DesignIssues/LinkedData.html

3https://www.w3.org/TR/rdf-schema/4https://www.w3.org/standards/techs/

owl\#w3c\_all5https://www.w3.org/TR/

rdf-sparql-query/

analysis method to identify graph patterns that canbe used to remove redundant triples and calculatethe volume of semantic redundancy. In (Joshi etal., 2013) authors employ frequent itemset (fre-quent pattern) mining techniques to generate aset of logical rules to compress RDF datasets andthen use these rules during decompression. Bothworks mention the idea of semantic compressionby removing derivable knowledge. Regarding theuse of SPARQL for quality assessment, (Fürberand Hepp, 2010) and (Kontokostas et al., 2014)use query templates to detect some quality prob-lems but semantic redundancy is not included.Inspired on the ideas of these works, we use aSPARQL query oriented approach to evaluate re-dundant class assignments and make this informa-tion explicit to users.

4 Redundant class assignments

As we know, linked datasets are basically setsof RDF triples and knowledge is specified usingmechanisms provided by RDFS and OWL, eachone with its own well-defined semantics (Hitzler etal., 2009). In this way, schema and instance levelassertions can be considered as propositions to for-mally describe the notion of derivable knowledge.For example, the notation {p1, p2} |= {p3, p4}(where |= is called entailment relation) states thatpropositions p3 and p4 (also p1 and p2) are log-ical consequences of propositions p1 and p2 ob-tained under a certain set of rules (logic). Con-sidering this, the concept of data redundancy canbe associated to what is known in mathematicallogic literature as independence, that is, the abil-ity to deduce a proposition from other proposi-tions. Formally, given a logic L (semantics) and aset of propositions P , it is defined as independentif for all proposition p

i

2 P does not hold that{P � p

i

} |= pi

. In this way, a non-independentset of propositions can be considered redundantsince it contains extra information that may not benecessary because if it is removed from the initialset, it can be obtained from the remaining proposi-tions applying an inference mechanism and keep-ing the same logical consequences. Similarly, anindependent set of propositions can be considerednon-redundant.

As we mentioned in section 2, class assign-ments are instance level assertions (or proposi-tions) that specify resources as members of cer-tain classes. Thus, given a resource r, its class as-

104

signment set (CASr

) contains all the propositionsthat specify the classes to which r belongs. Theidea of the previous paragraph can be applied toclass assignments to define semantic redundancysince the non-redundant class assignment set ofa resource r (NRCAS

r

) is the independent set ofCAS

r

. Then, the redundant class assignment set(RCAS

r

) can be considered as the difference ofthose sets. In the following subsections, we dis-cuss some techniques that can be used to computeNRCAS

r

on linked datasets. Then, we will useone of these techniques to perform our evaluation.

4.1 Using SPARQL queriesUsing SPARQL, a simple query can be imple-mented to get the NRCAS. For example, queryin listing 1 can be used to get the non-redundantset of classes of a given resource (specified by re-source URI).

SELECT DISTINCT ?cWHERE { rdf:type ?cFILTER regex(str(?c),"ont_URI","i")FILTER NOT EXISTS { rdf:type ?sc .FILTER regex(str(?sc),"ont_URI","i")?sc rdfs:subClassOf ?c }

}

Listing 1: SPARQL query example to get non-redundant class assignments

Note that the mentioned query example consid-ers only one ontology (filtered by ontology URI)and does not implement any inference mechanismat instance or schema level. This means that thequery will work while all class assignments andrelationships between the involved classes will bespecified explicitly on the dataset. If this is not thecase and a transitive closure of sub/super classesis needed, it is necessary to implement an algo-rithm that iterates recursively over these queriesuntil it gets the required classes. Using SPARQLproperty paths (e.g. rdfs:subClassOf* orrdfs:subClassOf+) it is possible to checkconnectivity of two classes by an arbitrary lengthpath (route through a graph between two graphnodes)6. For example, it can be used to get allthe classes that are descendant (or subclasses) ofa given class (as shows example of listing 2) or to

6https://www.w3.org/TR/sparql11-property-paths/

get all the ancestors (or depth) of a given class (asshows example of listing 3).

SELECT DISTINCT ?cWHERE {?c rdf:subClassOf*

}

Listing 2: SPARQL query to get class descendants

SELECT DISTINCT ?cWHERE { rdf:subClassOf* ?c

}

Listing 3: SPARQL query to get class ancestors

It is important to highlight that the performanceof SPARQL queries depends on its implemen-tation and the dataset size. Although complexSPARQL queries can become unacceptably slowwhen working with large amounts of data, it iscurrently the most practical mechanism to accesslinked datasets.

4.2 Using graph based algorithm andreasoners

Given a class hierarchy, a resource r and its CASr

,a way to compute redundancy is by interpretingthe class hierarchy as a directed acyclic graph “G”in which each node is a class and each edge isthe relation rdfs:subClassOf. A node “A”of “G” can be considered a class if there exist atriple with the form “A rdfs:subClassOf x”,“x rdfs:subClassOf A” or “x rdf:typeA”. Then, a class “B” is subclass of a class “A”if node “A” is reachable from node “B” in “G”,that is, if exists a path between B and A in thegraph. Considering this, given a proposition setQ that specifies the classes to which an instance ibelongs, a naive algorithm can be implemented tocompute a non-redundant proposition set R: firstset R=Q, then for each element q in Q check ifthere is a path from some of the remaining propo-sitions in Q to q, if so, q is deleted from R. Finally,the algorithm returns R which is then the non re-dundant class assignments set of r. Removingredundancies can be associated with the problemknown as transitive reduction (Aho et al., 1972)which has an unfortunate complexity class if it is

105

implemented naively. If the given graph is a fi-nite directed acyclic graph current approaches thatsolve this problem are close to the upper boundO(n2.3729) but if the graph has cycles the problembelongs to NP-hard class.

Another alternative to compute redundancy isby using Semantic Web reasoners that have beenimplemented based on decidable fragments ofRDFS and OWL semantics. These tools imple-ment inference mechanisms that can be used todeduce if a resource belongs to a given class (in-stance checking). The main advantage of usingreasoners is that the potential of the underling se-mantics can be exploited (e.g. several ontologiescan be combined to get implicit knowledge). Onthe other hand, the disadvantage of using thesetools in linked data scenario is its complexity: in-ference techniques work well for small exampleswith limited knowledge but they turn unacceptablyslow for large-scale datasets. Besides, when multi-ples ontologies are combined, inconsistencies canarise affecting the inference process and hamper-ing the detection of redundant propositions.

5 Evaluation

To perform our evaluation we selected the En-glish version of DBpedia7 and set up a local mir-ror using a Virtuoso8 server (version 7.2) . Themechanisms implemented to compute redundantclass assignment avoid the use of complex graphbased algorithms or RDFS/OWL reasoners anduse a SPARQL query oriented approach (see sec-tion 4.1). Although resources in DBpedia are clas-sified using several classes of different schema weonly considered the DBpedia9 core and YAGO10

ontologies because information about the involvedclass hierarchies (subclass relationships) can beobtained directly from queries through the datasetSPARQL endpoint. DBpedia ontology is a shal-low cross-domain ontology that covers more than600 classes and was created based on Wikipediainfoboxes. YAGO is a taxonomy used in theYAGO knowledge base that currently covers morethan 350,000 classes. The evaluation is organizedin the next subsections as follows: we first per-formed an overall redundancy evaluation consid-ering DBpedia and YAGO ontologies and then a

7http://wiki.dbpedia.org/Downloads2015-048http://virtuoso.openlinksw.com/9http://wiki.dbpedia.org/services-resources/ontology

10https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/

further analysis was done per class groups butonly considering the DBpedia class hierarchy inorder to keep the number of classes manageable.For each class group, we analyzed the relationshipbetween redundant class assignments (RCA) andthree class hierarchy characteristics: class depth,class descendants and number of class assign-ments per class.

5.1 Overall redundancy evaluationThe first overall evaluation was made by retriev-ing all resources that belong to some class ofthe DBpedia ontology (6,729,604 resources of453 classes) and then we did the same with theYAGO ontology (2,886,306 resources of 369,144classes). For each resource we compute its CAS,its NRCAS and its RCAS (see section 4) con-sidering both ontologies separately. Informationabout resources and its CAS and NRCAS wereobtained through SPARQL queries (see section4.1) and RCAS was obtained by computing thedifference CAS � NRCAS. Results can beviewed in table of figure 1. Each element ofeach CAS was counted as a different class assign-ment (nbCA column), each element of NRCAwas counted as a non-redundant class assignment(nbNRCA column) and each element of RCA wascounted as a redundant class assignment (nbRCAcolumn). As we can see in chart of figure 1,considering classes of the DBpedia ontology al-most half class assignments are redundant. On theother hand, considering the YAGO ontology 80%of class assignments are redundant. In the lattercase, the amount of class assignments is higherand the amount of concepts in the class hierar-chy increases considerably. These results sug-gest a relationship between the number of classes,class assignments and redundancy: as the numberof classes and class assignments increases, so theprobability of redundancy.

5.2 Redundancy and class depthTo analyze the relationship between redundantclass assignments and class depth we categorizedclasses into groups from 0 to 6 according to theirdepth in the DBpedia class hierarchy (the distancefrom the root to that class) and then we counthow many class assignments refer to those classes.Classes with depth 0 are the most general and 6is the max depth found in the class hierarchy. Aclass assignment refers to (or belongs to) a class Cif it is a triple of the form (resource rdf:type

106

Figure 1: DBpedia and YAGO redundancy evalu-ations

C). To compute the depth of a class we used aSPARQL query to count the number of ancestors(see section 4.1 listing 3). Results can be viewedin table 1, chart of figure 2 shows the relationshipbetween the class depth and the percentage of re-dundant class assignments (%RCA) and chart offigure 3 shows how these redundant class assign-ments are distributed.

Depth nbClasses nbCA nbRCA0 32 5,037,966 3,940,9201 86 3,941,230 2,877,4342 110 5,385,831 1,156,3273 180 1,154,660 118,8864 39 118,891 5,7775 5 5,777 06 1 889 0

Table 1: Redundancy and class depth evaluation.

As we can see on chart of figure 2, as the classdepth increases (more specific a class is), the num-ber of redundant class assignments decreases.

Figure 2: RCA vs class depth

Chart of figure 3 shows that more than 80% ofredundant class assignments refer to more generalclasses (with less depth). For example, in table1 we can see that there are 32 classes (nbClasses

column) of depth 0 (Depth column), 5,037,966class assignments (nbCA column) that refer tothose classes and 3,940,920 of them are redundant(nbRCA column). Examples of classes that belongto that group are Agent, Place, Work, etc.

Figure 3: RCA distribution considering the classdepth

5.3 Redundancy and class descendants

To analyze the relationship between redundantclass assignments and class descendants we cat-egorized classes into 10 groups according to thenumber of descendants that they have. To computethe descendants we used a SPARQL query to getthe subclasses of a given class (see section 4.1 list-ing 2). Results are showed in table 2 and chart offigure 4 shows the relationship between the num-ber of class descendants and the percentage of re-dundant class assignments (%RCA). Chart of fig-ure 5 shows how these redundant class assign-ments are distributed. Classes that do not havedescendants (330 classes) are the most specificand class assignments that belong to that groupare not redundant. As we can see in chart of fig-ure 4, when the number of descendant per classincreases, the number of redundant class assign-ments also increases. For example, group named“1 to 5” refers to classes that have between 1 to5 descendants (78 classes) and redundancy is rela-tively low. On the other hand, classes with severaldescendants (e.g. class Agent) has a high level ofredundancy. As chart of figure 5 shows, only 3classes have more than 100 descendants (Agent,Person and Place) and they concentrates the 65%of redundant class assignments.

107

nbDesc nbClasses nbCA nbRCA0 330 3,250,543 01 to 5 78 3,209,728 321,0786 to 10 16 948,187 524,26210 to 20 13 666,630 531,84821 to 30 8 773,316 751,48831 to 50 2 535,606 502,89251 to 70 3 809,171 784,10371 to 100 1 220,219 211,415101 to 200 2 2,860,585 2,117,098More than 200 1 5,231,844 4,472,258

Table 2: Redundancy and class descendants eval-uation.

Figure 4: RCA vs class descendants

5.4 Redundancy and number of classassignments

To analyze the relationship between redundantclass assignments and the number of class as-signments per class we categorized classes into10 groups according to the number of class as-signments that refers to a class. Table 3 showsthe evaluation results and chart of figure 6 showsthe relationship between redundancy and the num-ber of class assignments (or instances) per class.Columns nbCA-acc and nbRCA-acc show the totalnumber of class assignments and redundant classassignments in each group.

nbCA nbClasses nbCA-acc nbRCA-acc0 to 10K 315 599,261 101,86110K to 20K 35 484,931 120,51220K to 30K 34 588,845 267,46030K to 40K 25 384,601 220,27040K to 30K 11 900,290 36,156100K to 200K 7 906,730 365,085200K to 300K 5 1,274,010 1,222,844300K to 400K 1 396,046 395,804400K to 500K 2 934,843 681,445More than 500K 6 9,292,013 4,472,258

Table 3: Redundancy and number of class assign-ments evaluation.

As we can see, most of classes (315) have

Figure 5: RCA distribution considering class de-scendants

less than 10K class assignments. Besides, asthis number increases the number of classes in-volved decreases but the percentage of redundantclass assignments increases. For example, classesthat have more than 500K class assignments (e.g.Agent, Place, Person, etc.) concentrate most ofthem (9,292,013) and 48% are redundant. We alsoobserve that the second group of most used classes(between 200K and 500K class assignments) con-sists of about 8 classes with high levels of redun-dancy (between 70% and 95%).

Figure 6: RCA vs number of class assignments

6 Conclusions and future work

In this work we addressed the concept of seman-tic redundancy considering class assignments as-sertions in linked datasets. Based on a formaldefinition we discussed how redundant (and non-redundant) class assignments sets can be detected.Inspired in previous related work, we conductedan evaluation over the English version of DBpediabased on SPARQL queries. We analyzed the re-lationship between redundancy and three class hi-

108

erarchy characteristics: the number of instances aclass has, its depth and its number of descendants.

Regarding the evaluation results, they suggestthat there is a relationship between redundancyand depth of a class in a hierarchy: as more gen-eral a class is (less depth), more redundant classassignments can be found that refer to that class.In a similar way, as the number of descendantsper class increases so does the number of redun-dant class assignments related to that class. Par-ticularly, we noted that this also occurs in mostpopulated classes (with more class assignments).In this sense, datasets that use complex and largeclass hierarchies to typify their resources in uncon-trolled environments (such as crowsourced gener-ated content) may be more prone to class assign-ments redundancy. Considering this, we can makethe following observations:

• SPARQL queries as a mechanism to evalu-ate redundancy offer practical ways to imple-ment quality checks and get some statisticsof linked datasets. However, SPARQL in-ference capabilities are limited: we can dis-cover just some graph patterns using queriesbut other implicit knowledge will be unreach-able since we can not exploit all the semanticcapabilities or expressiveness of the used lan-guages.

• Redundancy analysis can be used to detectclass assignments patterns and data publish-ers behaviors. For example, in our evaluationwe detected that when a resource is assignedto a very specific class, it is also assigned ex-plicitly to the ancestors of that class. Discov-ering these kinds of patterns may be useful toimprove the linked data generation process oreven to understand how classes described in agiven ontology are used on a specific dataset.

Future work will be focused on the developmentand assessment of semantic redundancy metricsthat support our results on other linked datasets.Besides, we plan to study how redundancy canaffect other data quality dimensions particularlythose related with semantic accuracy.

ReferencesA. V. Aho, M. R. Garey, and J. D. Ullman. 1972. The

transitive reduction of a directed graph. SIAM Jour-nal on Computing, 1(2):131.

Christian Fürber and Martin Hepp. 2010. Using se-mantic web resources for data quality management.In Proceedings of the 17th International Conferenceon Knowledge Engineering and Management by theMasses, EKAW’10, pages 211–225, Berlin, Heidel-berg. Springer-Verlag.

Tom Heath and Christian Bizer. 2011. Linked Data:Evolving the Web into a Global Data Space, vol-ume 1 of Synthesis Lectures on the Semantic Web:Theory and Technology. Morgan Claypool, 1st ed.,html version edition, February.

Pascal Hitzler, Markus Krötzsch, and SebastianRudolph. 2009. Foundations of Semantic Web Tech-nologies. Chapman & Hall/CRC.

Aidan Hogan, Andreas Harth, Alexandre Passant, Ste-fan Decker, and Axel Polleres. 2010. Weaving thepedantic web. In Linked Data on the Web Workshop(LDOW2010) at WWW’2010, volume 628, pages30–34. CEUR Workshop Proceedings.

Amit Krishna Joshi, Pascal Hitzler, and Guozhu Dong,2013. The Semantic Web: Semantics and BigData: 10th International Conference, ESWC 2013,Montpellier, France, May 26-30, 2013. Proceedings,chapter Logical Linked Data Compression, pages170–184. Springer Berlin Heidelberg, Berlin, Hei-delberg.

Dimitris Kontokostas, Patrick Westphal, Sören Auer,Sebastian Hellmann, Jens Lehmann, Roland Cor-nelissen, and Amrapali Zaveri. 2014. Test-drivenevaluation of linked data quality. In Proceedings ofthe 23rd International Conference on World WideWeb, WWW ’14, pages 747–758, Republic andCanton of Geneva, Switzerland. International WorldWide Web Conferences Steering Committee.

Jens Lehmann, Robert Isele, Max Jakob, AnjaJentzsch, Dimitris Kontokostas, Pablo N. Mendes,Sebastian Hellmann, Mohamed Morsey, Patrick vanKleef, Sören Auer, and Christian Bizer. 2015. DB-pedia - a large-scale, multilingual knowledge baseextracted from wikipedia. Semantic Web Journal,6(2):167–195.

Jeff Z Pan, JM Gómez-Pérez, Yuan Ren, HonghanWu, and Man Zhu. 2014. Ssp: Compressing rdfdata by summarisation, serialisation and predictiveencoding. Technical report, Technical report, 072014. Available as http://www. kdrive-project. eu/re-sources.

Honghan Wu, Boris Villazn-Terrazas, Jeff Z. Pan, andJos Manul Gmez-Prez. 2014. How redundant isit? - an empirical analysis on linked datasets. InOlaf Hartig, Aidan Hogan, and Juan Sequeda, edi-tors, COLD, volume 1264 of CEUR Workshop Pro-ceedings. CEUR-WS.org.

Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ri-cardo Pietrobon, Jens Lehmann, and Sören Auer.2015. Quality assessment for linked data: A survey.Semantic Web Journal.

109

evaluating class assignment semantic redundancy on ...ceur-ws.org/vol-1743/paper12.pdfness (zaveri...

Documents