similarity reasoning for the semantic web based on fuzzy concept lattices: an informal approach

10
Inf Syst Front (2013) 15:511–520 DOI 10.1007/s10796-011-9340-y Similarity reasoning for the semantic web based on fuzzy concept lattices: An informal approach Anna Formica Published online: 29 January 2012 © Springer Science+Business Media, LLC 2012 Abstract Similarity Reasoning in the presence of vague information is becoming fundamental in several research areas and, in particular, in the Semantic Web. Fuzzy Formal Concept Analysis (FFCA) is a general- ization of Formal Concept Analysis (FCA) for mod- eling uncertainty information. Although FFCA has become very interesting for supporting different activi- ties for the development of the Semantic Web, in the literature it is usually addressed at a technical level and intended for a restricted audience. This paper pro- poses a similarity measure for FFCA concepts. The key notions underlying the proposed approach are pre- sented informally, in order to reach a broad audience of readers. Keywords Semantic web · Similarity reasoning · Fuzzy formal concept analysis 1 Introduction Similarity Reasoning, i.e. the identification of syntacti- cally different concepts that are semantically close, is fundamental in several research areas such as Cognitive Science, Artificial Intelligence, Software Engineering and, recently, in the Semantic Web (Berners-Lee et al. 2001). In this context, Similarity Reasoning becomes more important in the presence of vague information, A. Formica ( ) Istituto di Analisi dei Sistemi ed Informatica (IASI) “Antonio Ruberti”, National Research Council, Viale Manzoni 30, 00185, Rome, Italy e-mail: [email protected] when some data are more relevant than others, or when a feature does not fully describe (or is not totally appropriate to) the resource the user is looking for. These types of problems can be tackled with fuzzy information. Fuzzy Formal Concept Analysis (FFCA) is a generalization of Formal Concept Analysis (FCA) (Wille 1982) for modeling uncertainty information (Belohlávek et al. 2008). It is based on the notion of a Fuzzy Concept Lattice and can support different activities in the development of the Semantic Web such as ontology mapping, integration, alignment, etc. (Bai and Liu 2008; Tho et al. 2006). In this article a measure for evaluating the similar- ity of FFCA concepts is presented. It originates from a previous proposal for non-fuzzy Concept Lattices (Formica 2008), that was formalized by the author in 2010 (Formica 2010). With respect to other papers defined in the literature, the key concepts underlying the proposed approach are presented informally in or- der to reach a broad audience of readers. In particular, this proposal allows FFCA concept similarity to be evaluated as a combination of the similarity of concept extents (fuzzy sets) and concept intents. Concept in- tents are compared according to the information con- tent approach (Resnik 1995; Lin 1998), which has been extensively investigated in the literature, and which has a higher correlation with human judgement than traditional approaches. Note that this proposal is in line with a significant amount of literature that aims at using FCA as an interesting framework for supporting the development of the Semantic Web. In this perspective, Concept Lattices are assumed as given and the problem of computing and/or reducing the number of nodes of the lattice goes beyond the scope of this paper.

Upload: anna

Post on 23-Dec-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Inf Syst Front (2013) 15:511–520DOI 10.1007/s10796-011-9340-y

Similarity reasoning for the semantic web basedon fuzzy concept lattices: An informal approach

Anna Formica

Published online: 29 January 2012© Springer Science+Business Media, LLC 2012

Abstract Similarity Reasoning in the presence ofvague information is becoming fundamental in severalresearch areas and, in particular, in the Semantic Web.Fuzzy Formal Concept Analysis (FFCA) is a general-ization of Formal Concept Analysis (FCA) for mod-eling uncertainty information. Although FFCA hasbecome very interesting for supporting different activi-ties for the development of the Semantic Web, in theliterature it is usually addressed at a technical leveland intended for a restricted audience. This paper pro-poses a similarity measure for FFCA concepts. Thekey notions underlying the proposed approach are pre-sented informally, in order to reach a broad audience ofreaders.

Keywords Semantic web · Similarity reasoning ·Fuzzy formal concept analysis

1 Introduction

Similarity Reasoning, i.e. the identification of syntacti-cally different concepts that are semantically close, isfundamental in several research areas such as CognitiveScience, Artificial Intelligence, Software Engineeringand, recently, in the Semantic Web (Berners-Lee et al.2001). In this context, Similarity Reasoning becomesmore important in the presence of vague information,

A. Formica (�)Istituto di Analisi dei Sistemi ed Informatica (IASI)“Antonio Ruberti”, National Research Council,Viale Manzoni 30, 00185, Rome, Italye-mail: [email protected]

when some data are more relevant than others, orwhen a feature does not fully describe (or is not totallyappropriate to) the resource the user is looking for.These types of problems can be tackled with fuzzyinformation. Fuzzy Formal Concept Analysis (FFCA)is a generalization of Formal Concept Analysis (FCA)(Wille 1982) for modeling uncertainty information(Belohlávek et al. 2008). It is based on the notionof a Fuzzy Concept Lattice and can support differentactivities in the development of the Semantic Web suchas ontology mapping, integration, alignment, etc. (Baiand Liu 2008; Tho et al. 2006).

In this article a measure for evaluating the similar-ity of FFCA concepts is presented. It originates froma previous proposal for non-fuzzy Concept Lattices(Formica 2008), that was formalized by the author in2010 (Formica 2010). With respect to other papersdefined in the literature, the key concepts underlyingthe proposed approach are presented informally in or-der to reach a broad audience of readers. In particular,this proposal allows FFCA concept similarity to beevaluated as a combination of the similarity of conceptextents (fuzzy sets) and concept intents. Concept in-tents are compared according to the information con-tent approach (Resnik 1995; Lin 1998), which has beenextensively investigated in the literature, and whichhas a higher correlation with human judgement thantraditional approaches.

Note that this proposal is in line with a significantamount of literature that aims at using FCA as aninteresting framework for supporting the developmentof the Semantic Web. In this perspective, ConceptLattices are assumed as given and the problem ofcomputing and/or reducing the number of nodes of thelattice goes beyond the scope of this paper.

512 Inf Syst Front (2013) 15:511–520

The paper is organized as follows. In Sections 2 and3, Formal Concept Lattices and Fuzzy Formal ConceptLattices are recalled, respectively. In Section 4, thenotion of information content similarity is given and,in Section 5, the similarity between FFCA concepts isintroduced. The Related Work follows in Section 6,and in Section 7 the method is evaluated, discussed andcompared to the existing literature. Finally Section 8concludes.

2 Formal concept lattices

In FCA a context is a triple (O,A,R), where O is a setof objects, A is a set of attributes, and R is a binaryrelation between O and A. For instance, consider acontext named Sardinia Hotels, suppose that the set Ois defined by the following six objects representing sixdifferent hotels:

O = {H1, H2, H3, H4, H5, H6},

and the set A is defined by three possible attributes ofthese objects:

A = {SwPool, Sea, Meal}

where SwPool stands for swimming pool. Furthermore,suppose the hotels are related to the above attributesaccording to the binary relation R defined by Table 1.

Table 1 establishes that, for instance, the hotel H2has two attributes, namely SwPool and Meal and,viceversa, both attributes SwPool and Meal apply tothe object H2. Given a context, in FCA a concept isa pair (E,I), where the former element, referred toas concept extent, is a set consisting of precisely thoseobjects having all the attributes from the latter and,conversely, the latter, referred to as concept intent, is aset containing precisely those attributes that apply to allthe objects from the former. A concept of the SardiniaHotels context is:

((H1, H3, H5), (Sea, Meal))

Table 1 The Sardinia Hotels context in (non-fuzzy) FCA

SwPool Sea Meal

H1 x xH2 x xH3 x xH4 x xH5 x xH6 x x

because all the H1,H3,H5 objects have both the at-tributes Sea and Meal, and viceversa, both these at-tributes only apply to the objects H1,H3,H5.

Given two concepts of a context, (E1,I1), (E2,I2), it ispossible to establish an inheritance relation (≤) betweenthem as follows:

(E1, I1) ≤ (E2, I2) iff E1 ⊆ E2(iff I2 ⊆ I1).

In particular, (E1,I1) is called subconcept of (E2,I2) and(E2,I2) is called superconcept of (E1,I1).

Given a context (O,A,R), consider the set of all theconcepts of this context, indicated as L(O,A,R). Then:

(L(O, A, R), ≤)

is a complete lattice called Formal Concept Lattice(Concept Lattice for short), i.e., for each subset ofconcepts, the greatest common subconcept and theleast common superconcept exist (Wille 1982). For in-stance, the Concept Lattice that is constructed fromthe context of Table 1 is shown in Fig. 1. Note thatnodes are labeled with the concepts of the context,and arcs are established among the nodes whose as-sociated concepts are in ≤ relation. The ConceptLattice has also two special nodes, the maximumand minimum nodes, grouping all the objects andthe attributes of the context, respectively. In Fig. 1,the least common superconcept of, for instance, theconcepts ((H4,H6), (SwPool,Sea)) and ((H1,H3,H5),(Sea,Meal)) is ((H1,H3,H4,H5,H6), (Sea)), whereasthe greatest common subconcept of ((H2,H4,H6),(SwPool)) and ((H1,H2,H3,H5), (Meal)) is the con-cept ((H2), (SwPool,Meal)).

Unfortunately, modeling a domain of interest withtraditional FCA (i.e., with non-fuzzy sets) can be in-accurate when the attributes do not describe the ob-jects in a uniform way or, in other words, a given

( (H2,H4,H6), (SwPool) ) ( (H1,H2,H3,H5), (Meal) )

( (H2), (SwPool,Meal) )

( (H1,H2,H3,H4,H5,H6), () )

( (H1,H3,H4,H5,H6), (Sea) )

( (H4,H6), (SwPool,Sea) )( (H1,H3,H5), (Sea,Meal) )

( ( ), (SwPool,Sea,Meal) )

Fig. 1 Concept Lattice of the Sardinia hotels context

Inf Syst Front (2013) 15:511–520 513

attribute applies to different objects in different ways.For instance, in our example, consider the attributeSea. One should be able to distinguish the hotels justlocated on the sea, from that having a walking distanceseaside (reachable in, let us say, ten or twenty minutes).Analogously, regarding the attribute Meal, we wouldlike to be aware about the hotels providing both lunchand dinner, rather than half-board. Without the intro-duction of fuzzy information, we have no way to specifyhow appropriate is a feature, or an attribute, to a givenobject, therefore describing all the objects in a uni-form way.

3 Fuzzy formal concept lattices

A fuzzy set A in a space of points X is characterizedby a membership function μA(x) which associates witheach point x in X a real number in the interval [0,1]representing the grade of membership of x in A (Zadeh1965). Note that for an ordinary set, the membershipfunction can take only the values 1 and 0, dependingon x does or does not belong to A, respectively. Justto provide an example, assume X is a set of people,a fuzzy set Young is defined by associating with eachperson in X a real number in [0,1] establishing thedegree of youth of a person, such that the nearer thisvalue to unity, the higher the grade of membership of aperson in the set Young. The notion of a fuzzy relationis obtained by generalizing the notion of a fuzzy set, i.e.,a fuzzy relation R in X × Y is a fuzzy set in the productspace X × Y.

For instance, consider the Sardinia Hotels contextwith fuzzy information. This context, referred to asfuzzy context, is specified by the fuzzy relation givenin Table 2. In particular, crosses in Table 1 have beenreplaced by grades of membership, from 0 to 1, eachallowing us to quantify “how much” an object has, oris described by, an attribute and viceversa an attributeapplies to an object.

For instance, consider the hotel H2 in Table 2. Ithas the attribute SwPool with grade of membership1.0, which means that such attribute fully applies to the

Table 2 The Sardinia Hotels context in FFCA

SwPool Sea Meal

H1 1.0 1.0H2 1.0 0.5H3 0.7 0.5H4 1.0 1.0H5 0.3 1.0H6 1.0 0.8

hotel H2 (and viceversa the hotel H2 can be properlydescribed by the attribute SwPool). Instead, the objectH2 has the attribute Meal with a membership value0.5, which means that such an attribute partially appliesto this hotel (for instance it could provide meals justfor lunch). Analogously, Sea is an attribute that betterdescribes the hotel H1 than H3 (the related grades ofmembership are 1.0 and 0.7 respectively), whereas it ismore appropriate to H3 than H5 (having H5 grade ofmembership equal to 0.3). However, in order to addressonly objects related to attributes with relevant gradesof membership, a threshold is fixed such that the pairswith membership values less than the threshold areignored. For instance, consider our running exampleand assume that a threshold is fixed equal to 0.5. Afuzzy concept of the Sardinia Hotels fuzzy context is,for instance, the pair:

(((H1, 1.0), (H3, 0.5)), (Sea, Meal))

In fact the objects H1,H3 share both the attributes Seaand Meal and, viceversa, both attributes Sea and Mealonly apply to the objects H1 and H3, with membershipvalues which are not less than the threshold. Withrespect to the previous example without fuzziness, theobject H5 is not present due to the threshold. It isimportant to note that in a concept, in the case theattributes apply to an object with different grades ofmembership, the minimum among them is selected. Forinstance, the object H3 has the attributes Sea and Mealwith different grades of membership, that are 0.7 and0.5, respectively. In the concept above, the minimumvalue between them has been selected because it rep-resents the highest common grade of membership thatallows H3 to be described by both the attributes Seaand Meal.

Fuzzy Formal Concept Lattices (Fuzzy ConceptLattices for short) is defined similar to Concept

( ((H1,1.0),(H2,1.0),(H3,1.0),(H4,1.0),(H5,1.0),(H6,1.0)), ( ) )

( ((H2,1.0),(H4,1.0),(H6,1.0)), (SwPool) ) ( ((H1,1.0),(H2,0.5),(H3,0.5),(H5,1.0)),(Meal) )

( ((H2,0.5)), (SwPool,Meal) )

( ( ), (SwPool,Sea,Meal) )

( ((H4,1.0),(H6,0.8)), (SwPool,Sea) )( ((H1,1.0),(H3,0.5)), (Sea,Meal) )

( ((H1,1.0),(H3,0.7),(H4,1.0),(H6,0.8)), (Sea) )

Fig. 2 Fuzzy Concept Lattice of the Sardinia hotels context

514 Inf Syst Front (2013) 15:511–520

Lattices. The Fuzzy Concept Lattice that is constructedfrom the context of Table 2 is shown in Fig. 2.

4 Information content similarity

In the following, the notion of information content sim-ilarity will be recalled (Formica 2008), that allows sim-ilarity of concept intents (attributes) to be computed.It is based on the well-known information content ap-proach, which has been extensively investigated in theliterature (Lin 1998; Resnik 1995).

Let us consider a lexical database for the English lan-guage as, for instance, WordNet (WordNet 2010). Be-sides English concept nouns, WordNet contains verbs,adjectives and adverbs, each associated with the relatednatural language definition and frequency. Frequenciesare estimated using noun frequencies from large textcorpora, as for instance the Brown Corpus of AmericanEnglish. Concept nouns are organized according to theISA and PartOf relationships, and for each conceptnoun, a set of synonyms is given (SynSet). The proba-bility of a concept noun c, p(c), is defined as:

p(c) = f req(c)M

where f req(c) is the frequency of c from a text corpus,and M is the total number of observed instances ofnouns in the corpus. Similar to Formica (2008), in thispaper probabilities have been assigned according to theSemCor project, which labels subsections of the BrownCorpus to senses in the WordNet lexicon (Fellbaum1998).

In the following, a weighted ISA hierarchy is an ISAhierarchy where each concept noun c is associated withthe probability p(c).

For instance, below the definitions of Water, Lake,Stream, Beach, and Sea are given according toWordNet, and their frequencies (the number in paren-thesis):

(219) Water—the part of the earth’s surface coveredwith water (such as a river or lake or ocean);

(3) Lake—a body of (usually fresh) water sur-rounded by land;

(20) Stream—a natural body of running waterf lowing on or under the earth;

(14) Beach—an area of sand sloping down to thewater of a sea or lake;

(38) Sea—a division of an ocean or a large body ofsalt water partially enclosed by land;

....

A fragment of a weighted ISA hierarchy derivedfrom WordNet is shown in Fig. 3. A small set of sets ofsynonyms of WordNet that is relevant to our exampleis given below:

SynSet = {..., {Stream, Watercourse},{Water, Body_of _water}, ...}

Once probabilities have been associated with nouns,the starting assumption of the approach is that theinformation content of a concept noun c is defined as− logp(c), that is, as the probability of a concept nounincreases, the informativeness decreases, therefore themore abstract a concept noun, the lower its informationcontent. According to this approach, the similarity ofhierarchically organized concept nouns is given by themaximum information content shared by the concepts,that is, the more information two concepts share, themore similar they are (Resnik 1995). Note that giventwo concept nouns, say c1, c2, the maximum informa-tion content shared by c1, c2 in the taxonomy is pro-vided by the superconcept of c1, c2 whose informationcontent is maximum (i.e., when defined, the least com-mon superconcept). Starting from these assumptions,concept noun similarity is defined by the maximuminformation content shared by the concepts divided bythe information contents of the comparing concepts(information content similarity—ics) (Lin 1998).

In our running example consider Lake and Sea.Their least common superconcept exists in the hierar-chy and is represented by Water. Therefore:

ics(Lake, Sea) = 2 log p(Water)log p(Lake) + log p(Sea)

= 2 ∗ 8.66

14.85 + 11.18= 0.67.

In the following, since concept intents are defined bysets of attributes, we will refer to attributes rather thanconcept nouns.

Top (1)

.....

...

Water (0.00248)

Lake (0.00003) Stream (0.00023)

Beach (0.00016)

Sea (0.00043) .....

Fig. 3 A fragment of a weighted ISA hierarchy derived fromWordNet

Inf Syst Front (2013) 15:511–520 515

5 Similarity between FFCA concepts

In this section the notion of similarity between FFCAconcepts is introduced. We have seen that a concept inFFCA is a pair formed by a fuzzy set of objects and a setof attributes. FFCA concept similarity is computed byseparately evaluating the similarity of the fuzzy sets ofobjects and the similarity of the sets of attributes and,successively, by combining them as shown below.

Regarding the similarity of concept extents, we needto introduce a few basic notions about fuzzy set theory.

Informally, given two fuzzy sets A,B, the fuzzy setintersection and fuzzy set union are defined by fuzzysets where the membership functions are the minimumand maximum of the membership functions of A andB,

respectively. Furthermore, the cardinality of a fuzzy setA in X, denoted as | A |, is given by the sum of thegrades of membership of all the elements of X definedin A. For instance, the cardinality of the fuzzy set A:

A = ((H1, 1.0), (H2, 0.5), (H3, 0.5), (H5, 1.0))

is | A |= 3.Finally, the similarity of two fuzzy sets A and B, here

referred to as SetSim(A, B), is given by the cardinalityof the fuzzy set intersection divided by the cardinality ofthe fuzzy set union of A and B (Tho et al. 2006; Park etal. 2008). For instance, consider the fuzzy set A aboveand the fuzzy set B = ((H2,1.0),(H4,1.0), (H6,1.0)). Thesimilarity SetSim(A, B) is defined as follows:

SetSim(A, B) = | A ∩ B || A ∪ B | = | (H2, 0.5) |

| (H1, 1.0), (H2, 1.0), (H3, 0.5), (H4, 1.0), (H5, 1.0), (H6, 1.0) |= 0.5

5.5= 0.09

With regard to concept intents, we have seen theyare represented as sets of attributes. The comparisonbetween concept intents presented below has been in-spired by the maximum weighted matching problem inbipartite graphs (Formica 2008). Informally, considera lexical database for the English language, and twosets of attributes I1,I2 of two fuzzy concepts that do notnecessarily belong to the same context. Let a candidateset of pairs be a subset of I1 × I2 such that thereare no two pairs in the set sharing an element. Forinstance, assume that I1 and I2 represent a set of boysand a set of girls, respectively, a candidate set of pairsdefines a possible set of marriages (when polygamyis not allowed). Within all possible candidate sets ofpairs, consider (one of) the set(s) such that the sum ofthe values of the information content similarity (ics) ofthe pairs of attributes is maximal. Such a sum will beindicated as M(I1, I2).

For instance in our running example, assume I1 ={SwPool, Sea}, and I2 = {SwPool, Meal}. Within allpossible sets of pairs of attributes that can be formedwith I1 and I2 as described above, the set of pairs withmaximal sum is the following:

{(SwPool, SwPool), (Sea, Meal)},

because

ics(SwPool, SwPool) = 1,

and

ics(Sea, Meal) = 0. Therefore :

M((SwPool, Sea), (SwPool, Meal)) = 1,

whereas the other possible set of pairs:

{(SwPool, Meal), (Sea, SwPool)}leads to a null value (the ics of both the pairs are null).

Below the notion of similarity between FFCA con-cepts, referred to as ConSim, is presented. It is es-sentially given by the weighted average between thefuzzy set similarity of the concept extents SetSim, andthe maximal sum M(I1, I2) above (up to a normaliza-tion factor). Formally, given two fuzzy concepts C1 =(E1, I1), C2 = (E2, I2):

ConSim(C1, C2) = SetSim(E1, E2) ∗ w + M(I1, I2)

m∗(1 − w)

where SetSim is the fuzzy set similarity, M(I1, I2) isdefined as above and m is the greatest between thecardinalities of the sets I1,I2. Finally w is a weight,such that 0 ≤ w ≤ 1, which is established by the domainexpert, according to the characteristics of the addresseddomain.

It is easy to verify that ConSim is reflexive andsymmetric, whereas triangular inequality does not holdfor it. In fact, ConSim relies on the notion of infor-mation content and, as extensively investigated in Lin

516 Inf Syst Front (2013) 15:511–520

(1998), the semantic similarity measures based on theinformation content approach do not satisfy triangularinequality. The inadequacy of triangular inequality forevaluating semantic similarity is also stated by Tverskyin Tversky (1977), and is essentially illustrated by thefollowing example: “Jamaica is similar to Cuba (be-cause of their geographical proximity), Cuba is simi-lar to Russia (because of their political affinity), butJamaica and Russia are not similar at all”.

We observe that, in line with the work for non-fuzzy concepts given in Formica (2008), the informationcontent approach leads to a fundamental differencewith respect to other proposals. This point, that will alsobe discussed in the Related Work Section, is illustratedby the following example.

In our running example assume w = 12 , and consider

the concept:

C1 = (((H4, 1.0), (H6, 0.8)), (SwPool, Sea))

of the Fuzzy Concept Lattice of Fig. 2. Consider nowthe fuzzy concept C2 below, belonging to a differentcontext, say Italian Hotels, containing the object H7and the attribute Lake do not belonging to the contextSardinia Hotels:

C2 =(((H2, 1.0), (H4, 0.7), (H7, 0.6)), (SwPool, Lake))

The similarity between C1 and C2 is computed as fol-lows:

SetSim(((H2, 1.0), (H4, 0.7), (H7, 0.6)),

((H4, 1.0), (H6, 0.8))) = 0.7

3.4= 0.20

We have seen that ics(Sea, Lake) = 0.67, therefore:

M((SwPool, Sea), (SwPool, Lake)) = 1.67

and (m = 2):

ConSim(C1, C2) = 1

2∗ 0.20 + 1

2∗ 1.67

2= 0.52.

As a result, the information content approach al-lows us to automatically evaluate the similarity degreebetween Sea and Lake. Conversely, the analysis per-formed by a panel of experts in the given applicationdomain is needed, establishing axiomatic similarity de-grees for attribute pairs. In this proposal, as also inFormica (2008, 2010), the human expertise has beenreplaced by the notion of ics that makes use of lexicaldatabases for the English language available on theInternet (in this example WordNet).

Note that, with regard to time complexity, the sim-ilarity between concept intents is computed accordingto the Hungarian algorithm which solves the maximum

weighted matching problem in bipartite graphs in poly-nomial time (Kuhn 1955).

6 Related work

In this section, in order to provide an overview ofthe research for Semantic Web development, first webriefly analyze how FCA, fuzzy theory, and RoughSet theory have been employed in this area. Then,in the next subsection, we will focus on the similaritymeasures for the development of the Semantic Web.

In Semantic Web area, FCA (without fuzziness) hasbeen employed in several works in the literature. Forinstance, in Kim and Park (2001), Web search has beenperformed by using dynamic keyword suggestion in thecontext of FCA. Regarding ontology building, mappingand alignment supported by FCA, the works presentedin Stumme and Maedche (2001) and Hwang et al.(2005) are just two examples about the rich literaturethat has been defined in the last ten years. A fur-ther example can be found in Beydoun (2009), whereFCA has been used as a knowledge acquisition frame-work within an e-learning community in an AmericanUniversity.

The importance of employing fuzzy theory for Se-mantic Web search and development, independentlyof FCA, has been emphasized in several works. Forinstance, in Zhang et al. (2010), a formal approach andan automatic tool for building fuzzy ontologies fromfuzzy Object-Oriented database models have been pre-sented. Several research efforts are emerging for rep-resenting and reasoning with vague and uncertaintyinformation in the Semantic Web by addressing De-scription Logics (Lukasiewicz and Straccia 2008). InSimou et al. (2008), a novel framework for queryingfuzzy knowledge bases for Semantic Web developmenthas been proposed. The work is based on the inte-gration of a fuzzy Description Logic reasoner withRDF triples accommodating fuzzy elements. Finally,in Buche et al. (2006), an approach for building anXML data warehouse from semantically tagged dataextracted from the Web has been proposed. Both tag-ging and querying are fuzzy: the former because tagsare weighted, the latter because the user can expresspreferences by means of fuzzy criteria.

Rough Set theory has been employed in severalworks for the development of the Semantic Web. Forinstance in Ali and Beg (2009), an architecture of a Websearch system has been presented using a rough setbased rank aggregation technique, where ranking rulesare learnt on the basis of user feedbacks. In Dohertyet al. (2003), a formal framework for generating

Inf Syst Front (2013) 15:511–520 517

approximate concepts and ontologies from traditionalcrisp ontologies has been presented. In Jiang et al.(2009), uncertain knowledge representation and rea-soning in description logics have been addressed, anda new rough description logic based on approximateconcepts has been proposed.

6.1 Similarity measures for the semantic web

Concept similarity has been extensively investigatedin the literature in different research areas, such asNatural Language Processing, Information Retrieval,Machine Learning, etc., and recently in the SemanticWeb. Similarity measures for the development of theSemantic Web have been proposed in the literaturewith different scopes and by following different ap-proaches. For instance in Park et al. (2008), a com-bined measure for Web usage mining has been definedin order to extract patterns and trends in Web usersbehaviors. In the mentioned paper, although with adifferent purpose, a fuzzy similarity notion has beengiven corresponding to our SetSim measure definedabove. Concept similarity in FCA has been addressedin Formica (2006), by relying on human domain exper-tise, and in Formica (2008), according to the informa-tion content approach although, in both cases, withoutfuzzy values. Again, in Wang and Liu (2008), Zhaoet al. (2007), interesting proposals for the definition ofsimilarity measures for the Semantic Web have beenpresented, combining Rough Set theory and FCA with-out fuzzy values.

An interesting proposal concerning similarity inFuzzy Concept Lattices has been defined in Belohláveket al. (2008). It originates mainly to solve the problemrelated to the large number of concepts that can be ex-tracted from data in a Fuzzy Concept Lattice. In partic-ular, in the mentioned paper, both concept extents andintents are fuzzy and concept similarity is established onthe basis of object similarity or (in an equivalent way)attribute similarity. As we will show in the next section,the conditions for assessing concept similarity requiredin Belohlávek et al. (2008) are strong due to the scopeof the work, i.e., the need of simplifying the structureof a concept lattice by considering collections of similarconcepts. Here, in our work, the proposed similaritymeasure is intended to support the development ofthe Semantic Web and, in particular, activities such asontology mapping, integration, etc. where, in general,the intensional components of concepts are definedindependently of the extensional ones (Bai and Liu2008).

Finally, concept similarity has been addressed in Thoet al. (2006), where a framework for automatic gener-

ation of fuzzy ontologies from uncertainty informationhas been defined, named FOGA. Similar to Park et al.(2008), in FOGA the notion of fuzzy formal conceptsimilarity corresponds essentially to the similarity of theconcept extents, SetSim, defined in this paper.

In the next section, the above mentioned proposalswill be evaluated and compared with the similaritymeasure ConSim defined in this paper.

7 Evaluation of the method and discussion

In Table 3 some representative papers for the develop-ment of the Semantic Web have been selected. In thefirst row, FFCA, FCA, FT, and RST stand for FuzzyFormal Concept Analysis, Formal Concept Analysis,Fuzzy Theory, and Rough Set Theory, respectively. Thesimilarity column has been inserted to point out thatthe majority of these papers do not address similaritymeasures. As mentioned above, they concern for in-stance, the performances of Web search Engines basedon Rough Set Theory (Ali and Beg 2009), the problemof ontology merging and/or ontology building, with thesupport of RST or FCA (Doherty et al. 2003; Hwanget al. 2005, etc.), the development of Description Logic

Table 3 Some representative proposals for Semantic Webdevelopment

FFCA FCA FT RST similarity

Ali and Beg (2009) xBelohlávek

et al. (2008) x xBeydoun (2009) xBuche et al. (2006) xDoherty

et al. (2003) xFormica

(2006; 2008) x xFormica (2011) x xHwang et al. (2005) xJiang et al. (2009) xKim and

Park (2001) xLukasiewicz and

Straccia (2008) xPark et al. (2008) x xSimou et al. (2008) xStumme and

Maedche (2001) xTho et al. (2006) x xWang and

Liu (2008) x x xZhang et al. (2010) xZhao et al. (2007) x x xThis proposal x x

518 Inf Syst Front (2013) 15:511–520

reasoners (Lukasiewicz and Straccia 2008; Jiang et al.2009), and so on.

By focusing on the papers of Table 3 addressingsimilarity measures and FFCA, we have to restrict theattention to Belohlávek et al. (2008), Tho et al. (2006)and this proposal. In fact, by examining the remainingpapers dealing with similarity, in Formica (2006, 2008)no fuzzy values are considered, and the same holds inthe case of Wang and Liu (2008) and Zhao et al. (2007).Finally in Park et al. (2008), although with a completelydifferent scope, a fuzzy similarity measure is addressedwhich corresponds to SetSim as defined in this paper.For this reason, in the following, an explicit comparisonwith Park et al. (2008) is missing because the authorsdo not address FFCA, and the similarity measure theydefine behaves similar to Tho et al. (2006), which isillustrated below.

In order to compare the proposals, let us analyze thesimilarity values computed according to Belohlávek etal. (2008) and Tho et al. (2006) in the case of the pairof concepts C1, C2 in our running example, that arerecalled below:

C1 = (((H4, 1.0),(H6, 0.8)),(SwPool, Sea))

C2 = (((H2, 1.0),(H4, 0.7),(H7, 0.6)),(SwPool, Lake)).

In the Related Work Section we have anticipatedthat the conditions for evaluating concept similarity inBelohlávek et al. (2008) are strong due to the scopeof the work, i.e., the need of simplifying the structureof a concept lattice. In particular, concept similarityis established on the basis of object similarity or (inan equivalent way) attribute similarity. In essence, twoobjects g1, g2 are similar if they cannot be separated byany concept, i.e., for each concept c, g1 belongs to theextent of c iff g2 belongs to the extent of c. It is easyto see that, according to this approach, the concepts C1,C2 above are not similar.

In Tho et al. (2006), concept similarity is evaluatedby focusing on concept extents, disregarding the simi-larity of concept intents. Therefore in the case of theconcepts C1, C2 in our running example, their similarityis computed according to our SetSim as follows:

SetSim(((H4, 1.0), (H6, 0.8)), ((H2, 1.0),

(H4, 0.7), (H7, 0.6)))

= 0.7

3.4= 0.20

Before comparing these two proposals with ConSim,and for the sake of completeness, it is worth recallingthat in the literature there is a significant amount ofwork concerning similarity measures between featurevectors, independently of fuzzy sets. Such measures can

be used for evaluating the similarity between conceptintents, disregarding concept extents. Therefore, in ourexperiment some representative proposals have beenselected, and in particular the Dice, Jaccard, Cosine(Maarek et al. 1991), and WeightedSum (Castano et al.1998) measures, that are briefly recalled below.

Assume that I1, I2 are the intents (feature vectors)of two concepts C1, C2 respectively. The similaritybetween I1, I2, according to the selected proposals, isdefined as follows:

Dice(I1, I2) = 2| I1 ∩ I2 |

| I1 | + | I2 |Jaccard(I1, I2) = | I1 ∩ I2 |

| I1 ∪ I2 |Cosine(I1, I2) = | I1 ∩ I2 |

| I1 | ∗ | I2 |

WeightedSum(I1, I2) = 2�Af f (I1i , I2 j)

| I1 | + | I2 |where Af f(I1i, I2 j) is the affinity between the attributesI1i and I2 j , that is defined as follows:

Af f (I1i ,I2 j)=

⎧⎪⎪⎨

⎪⎪⎩

1 if I1i =I2 j

0.5 if I1i , I2 j

are related in the ISA hierarchy0 otherwise

and, similar to our approach, each feature can partici-pate in at most one affinity pair.

Note that, according to the measures above, similar-ity is established on the basis of common attributes (i.e.,according to the intersection of the sets of attributes).In the case of WeightedSum, attributes that are broaderor narrower concepts in the ISA hierarchy are also ad-dressed. For instance, in our running example, considerthe intents:

I1 = (SwPool, Sea)

I2 = (SwPool, Lake)

All the above measures agree on the similarity val-ues of the pair (Sea,Lake). In fact, for this pair thesimilarity value is zero even in the case of Weighted-Sum because Sea and Lake are not related in the ISAhierarchy (see Fig. 3). Viceversa, in the case of ConSimwe have seen that concept attributes are compared byrelying on Lin’s approach, and the similarity betweenthe attributes Sea and Lake is defined according tothe information content similarity (ics) as follows (seeSection 4):

ics(Sea, Lake) = 0.67

Inf Syst Front (2013) 15:511–520 519

Note that Lin’s approach has been adopted becauseit shows a higher correlation with human judgementthan other methods for evaluating concept similaritywithin a taxonomy, such as Resnik, Wu&Palmer, etc.,and the traditional time-honored method referred to asthe edge-counting approach (Lin 1998).

In Table 4 we have summarized the similarity valuesfor the pair of concepts C1, C2 in our running exam-ple according to the approaches by Belohlávek et al.,and Tho et al., which address fuzzy values, and Dice,Jaccard, Cosine, and WeightedSum similarity measuresfor feature vectors. In the first row, ExtSim, IntSim,and Comb stand for extensional similarity, intensionalsimilarity, and combined similarity, respectively. Theyhave been introduced in order to distinguish themethods that allow us to evaluate the similarity ofconcept extents, concept intents, and a combinationof both, respectively. As shown in the table, althoughthe approach proposed by Belohlávek et al. has beenconceived in order to compute the similarity of bothconcept components, in this case IntSim is not com-putable because in our approach concept intents arenot fuzzy. As a result, in all the listed proposals, exceptfor ConSim, the similarity of only one between thetwo concept components can be evaluated and no com-bination of them is proposed. According to ConSim,a weighted average of ExtSim, IntSim is computedwhere, as a further benefit of our proposal, the para-meter w, 0 ≤ w ≤ 1, can be tuned depending on theapplication domain. In particular, it allows us to es-tablish the relevance of the extensional with respect tothe intensional components in the specific domain. Theparameter w is useful, for instance, in the problem ofontology integration, which is a core issue of SemanticWeb research.

For instance, in the case of the concepts C1, C2, wehave:

ConSim(C1, C2) = w ∗ 0.20 + (1 − w) ∗ 0.67

and suppose we have to construct a global ontologyfor winter holidays in Italy, by integrating a set of

Table 4 Similarity between C1, C2

ExtSim IntSim Comb

Belohlávek et al. (2008) 0 – –Tho et al. (2006) 0.20 – –Dice – 0.50 –Jaccard – 0.25 –Cosine – 0.25 –WeightedSum – 0.50 –ConSim 0.20 0.67 w ∗ ExtSim

+ (1 − w) ∗ IntSim

overlapping local ontologies about Italian tourism. Thegoal is therefore to construct the integrated global on-tology for winter holidays by extracting and unifyinginformation from the local ontologies, as for instance,those related to the Sardinia Hotels and Italian Hotelscontexts in our running example. In the comparison ofthe concepts C1, C2, belonging to different contexts,the parameter w allows the domain expert to giveless relevance to the comparison of concept intents,rather than concept extents, because they involve theattributes Sea, Lake, and SwPool that, in general, arenot considered mandatory for a winter holiday.

The experiment we performed was articulated ac-cording to the following steps: (i) identification of afragment of the WordNet ISA hierarchy containingabout 50 concept nouns (attributes) from the tourismdomain; (ii) construction and identification of a frag-ment of a Fuzzy Concept Lattice (about 70 concepts),whose concept intents overlap with the nouns of theselected ISA hierarchy; (iii) identification of a selectedgroup of 15 students taught to deal with the FFCAframework; (iv) identification of 18 pairs of fuzzy con-cepts from the Fuzzy Concept Lattice, and submissionof this set of pairs to the students asking them, foreach pair, to specify a number in the interval [0,1]assessing their similarity degree (human judgment); (v)evaluation of the similarity values (ExtSim, IntSim,and Comb with w = 0.5) of each pair of fuzzy conceptsaccording to the proposals listed in Table 4; (vi) evalu-ation of the correlation with human judgment.

The results of our experiment have shown, as ex-pected, a correlation of ConSim with human judgmenthigher than the other mentioned proposals, with anaverage increment of about 0.3. The reasons underlyingthe strong imbalance in favor of our approach are dueto the inherently different scopes and assumptions ofthe proposals, as explained above and in the RelatedWork Section.

8 Conclusion

In this paper a measure for evaluating the similarityof FFCA concepts is proposed, that is an extension ofa previous proposal of the author for computing thesimilarity between non-fuzzy FCA concepts (Formica2008). The contribution of this proposal consists in eval-uating FFCA concept similarity as a combination of thesimilarity of concept extents (fuzzy sets) and conceptintents. In particular, concept intents are comparedaccording to the information content approach, whichallows a higher correlation with human judgement thantraditional approaches (Lin 1998). The proposal has

520 Inf Syst Front (2013) 15:511–520

been evaluated, experimented, and compared with theexisting literature, as shown in the last two sections.

References

Ali, R., & Beg, S. (2009). Automatic performance evaluation ofweb search systems using rough set based rank aggregation.In Proc. of the f irst int. conf. on Intelligent Human ComputerInteraction (IHCI) (pp. 344–358).

Bai, L., & Liu, M. (2008). A fuzzy-set based semantic similaritymatching algorithm for web service. In Proc. of the IEEE int.conference on services computing (Vol. 2). IEEE ComputerSociety.

Belohlávek, R., Outrata, J., & Vychodil, V. (2008). Fast factor-ization by similarity of fuzzy concept lattices with hedges.International Journal of Foundations of Computer Science,19(2), 255–269.

Berners-Lee, T., et al. (2001). The semantic web. ScientificAmerican.

Beydoun, G. (2009). Formal concept analysis for an e-learning se-mantic web. Expert Systems with Applications, 36(8), 10952–10961.

Buche, P., Dibie-Barthlemy, J., Haemmerl, O., & Hignette, G.(2006). Fuzzy semantic tagging and flexible querying ofXML documents extracted from the web. Journal of Intel-ligent Information Systems, 26(1), 25–40.

Castano, S., De Antonellis, V., Fugini, M. G., & Pernici, B.(1998). Conceptual schema analysis: Techniques and ap-plications. ACM Transactions on Database Systems, 23(3),286–333.

Doherty, P., Grabowski, M., Lukaszewicz, W., & Szalas, A.(2003). Towards a framework for approximate ontologies.Fundamenta Informaticae, 57, 147–165.

Fellbaum, C. (1998). Semantic network of english: The mother ofall wordNets. Computers and the Humanities, 32, 209–220.

Formica, A. (2006). Ontology-based concept similarity in formalconcept analysis. Information Sciences, 176(18), 2624–2641.

Formica, A. (2008). Concept similarity in formal concept analysis:An information content approach. Knowledge-Based Sys-tems, 21(1), 80–87.

Formica, A. (2010). Concept similarity in fuzzy formal con-cept analysis for semantic web. International Journal of Un-certainty, Fuzziness And Knowledge-Based Systems, 18(2),153–167.

Formica, A. (2012). Semantic web search based on rough sets andfuzzy formal concept analysis. Knowledge-Based Systems, 26,40–47.

Hwang, S., Kim, H., & Yang, H. (2005). A FCA-based ontologyconstruction for the design of class hierarchy. Computationalscience and its applications (ICCSA), LNCS (Vol. 3842,pp. 827–835).

Jiang, Y., Wang, J., Tang, S., & Xiao, B. (2009). Reasoning withrough description logic: An approximate concepts approach.Information Sciences, 179, 600–612.

Kim, B., & Park, Y. (2001). Web search using dynamic keywordsuggestion based on formal concept analysis. In W. Smari(Ed.), Proc. of the ISCA 3rd Int. Conf. on Information Reuseand Integration (IRI) (pp. 108–114), 27–29 November 2001.Las Vegas, Nevada, USA.

Kuhn, H. W. (1955). The Hungarian method for the assignmentproblem. Naval Research Logistics Quarterly, 2, 83–97.

Lin, D. (1998). An information-theoretic definition of similarity.In Proc. of the int. conference on machine learning (pp. 296–304). Madison: Morgan Kaufmann.

Lukasiewicz, T., & Straccia, U. (2008). Managing uncertainty andvagueness in description logics for the semantic web. Journalof Web Semantics, 6(4), 291–308.

Maarek, Y. S., Berry, D. M., & Kaiser, G. E. (1991). An informa-tion retrieval approach for automatically constructing soft-ware libraries. IEEE Transactions on Software Engineering,17(8), 800–813.

Park, S., Suresh, N. C., & Jeong, B. (2008). Sequence-based clus-tering for web usage mining: A new experimental frameworkand ANN-enhanced K-means algorithm. Data & KnowledgeEngineering, 65(3), 512–543.

Resnik, P. (1995). Using information content to evaluate se-mantic similarity in a taxonomy. In Proc. of the int. jointconference on artif icial intelligence, (pp. 448–453). Montreal:Morgan Kaufmann. Accessed 20–25 August 1995.

Simou, N., Stoilos, G., Tzouvaras, V., Stamou, G. B., & Kollias,S. D. (2008). Storing and querying fuzzy knowledge in thesemantic web. ISWC Workshop Uncertainty Reasoning forthe Semantic Web (URSW), Karlsruhe, Germany.

Stumme, G., & Maedche, A. (2001). FCA-MERGE: Bottom-up merging of ontologies. In Proc. of International JointConference on Artif icial Intelligence (IJCAI), (pp. 225–234).Seattle, USA.

Tho, Q. T., Hui, S. C., Cheuk, A., Fong, M., & Cao, T. H.(2006). Automatic fuzzy ontology generation for semanticweb. IEEE Transactions on Knowledge and Data Engineer-ing, 18(6), 842–856.

Tversky, A. (1977). Features of similarity. Psychological Review,84(4), 327–352.

Wang, L., & Liu, X. (2008). A new model of evaluating conceptsimilarity. Knowledge-Based Systems, 21(8), 842–846.

Wille, R. (1982). Restructuring lattice theory: An approach basedon hierarchies of concepts. In I. Rival (Ed.), Proc. of orderedsets (pp. 445–470). Dordrecht: Reidel.

WordNet: A lexical database for the English language.http://www.cogsci.princeton.edu/.

Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353.

Zhang, F., Ma, Z. M., Fan, G., & Wang, X. (2010). Automaticfuzzy semantic web ontology learning from fuzzy object-oriented database model. Database and Expert Systems Ap-plications (DEXA), LNCS (Vol. 6261, pp. 16–30).

Zhao, Y., Halang, W., & Wang, X. (2007). Rough ontology map-ping in E-business integration. Studies in Computational In-telligence, 37, 75–93.

Anna Formica received the full-honors degree in Mathematicsfrom the University of Rome “La Sapienza” in 1989. Currently,she is a researcher at the “Istituto di Analisi dei Sistemi edInformatica” (IASI) “Antonio Ruberti” of the Italian NationalResearch Council (Consiglio Nazionale delle Ricerche—CNR),in Rome, where she works within the Information Systems andKnowledge Bases group, and the Laboratory for EnterpriseKnowledge and Systems (LEKS). She serves as referee of sev-eral international journals and conferences and she took part invarious research projects of the European Framework Programsand bilateral projects with international institutions. Her currentresearch interests are the semantic web, similarity reasoning,formal specification and validation of domain ontologies.