information retrieval using semantic similarity

Seminar on Artificial Intelligence

Information RetrievalUsing

Semantic Similarity

Harshita Meena (100050020)Diksha Meghwal (100050039)

Saswat Padhi (100050061)

Overview ...

● “Semantics” & “Ontology” (Diksha)

● What is IR lacking?● Semantics: “What”? And How?● Ontologies and knowledge representation

● Semantic Similarity (Harshita)

● Semantic Similarity: What? and How?● Path based semantic similarity measures● Information content based similarity measures

● Information Retrieval (Saswat)

● VSM Revisited● SSRM: IR with semantics● Conclusion and further reading

“Semantics” & “Ontology”

What is IR (without semantics) lacking?

“MEANING”

Query: softwarePool: application, program, package, freeware, sharewareResult: No match!!

motivation for looking at semantic rather than lexical similarity

The problem today in information retrieval is not lack of data, but the lack of “structured” and “meaningful organisation” of data.

Ontologies are attempts to organise information and empower IR.


Semantics: What? And How?

“Semantics” capture the meaning of the linguistic terms. Computers do not understand “meaning”. So, the semantic meanings of terms are rather represented using links to other terms.

An “ontology” formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about entities.

Formal definition by Tom Gruber:An ontology is a formal, explicit specification of a shared conceptualization

● formal: it should be machine readable● explicit: types of concept and the constraints are explicitly defined● shared: the ontology is agreed upon and accepted by a group● conceptualization: An abstract model that consists of relevant models

and the relationships between them


Components of Ontologies

● Classes : Classes are abstract groups, or collections of objects. They may contain individuals, other classes, or a combination of both. Classes can be extensional or intensional, subsume or be subsumed.

● Attributes: Used to store information that is specific to the object it is attached to like its features or characteristics.

● Relationships: A relation is an attribute whose value is another object in the ontology. Eg: subsumption relations(is-superclass-of, the converse of is-a, is-subtype-of or is-subclass-of), meronym relations(part-of).

● Domain ontology (or domain-specific ontology) models a specific domain, or part of the world.

● Upper ontology (or foundation ontology) is a model of the common objects that are applicable across a range of domain ontologies.


Examples of Popular OntologiesMedical Subject Headings

MeSH is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM's catalog of book holdings.

WordNet

WordNet is a lexical database for the English language, which superficially resembles a thesaurus. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.


The Future: “Semantic Web”, OWL and RDF ...The Semantic Web is a collaborative movement led by the international standards body W3C. Semantic Web aims at converting the current web dominated by semi-structured documents into a organised "web of data".

RDF(Resource Description Framework) is a part of the W3C family of specifications, which can be used as a general method for conceptual description or modeling of information.

<rdf:RDF <rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description>

</rdf:RDF>

OWL is built on top of the RDF and is stronger and supports greater machine interpretability than RDF.

<rdf:RDF<owl:Ontology rdf:about="http://www.linkeddatatools.com/plants">

<dc:title>The LinkedDataTools.com Example Plant Ontology</dc:title><dc:description>An example ontology</dc:description>

</owl:Ontology><owl:Class rdf:about="http://www.linkeddatatools.com/plants#planttype">

<rdfs:label>The plant type</rdfs:label><rdfs:comment>The class of plant types.</rdfs:comment>

</owl:Class></rdf:RDF>

Semantic Similarity

Ontology is just a “structure”, without any weights on the edges.

Semantic similarity measures exploit the structure information and try to quantify the concept similarities in a given ontology.

Ontology based semantic measures can be classified as follows:

● Path Based Similarity MeasuresPath based similarity measures utilize the information of the shortest path between two concepts, their generality or specificity and their relationship with other concepts.

● Information Content Based Similarity MeasuresInformation content based measures associate a quantity IC which takes into account, the probabilities of concepts in the ontology.

● Feature Based Similarity Measures (we won't be discussing)

Semantic Similarity (Path Based)

Wu & Palmer Measure:

Wu and Palmer measure fits the intuition that concepts with greater depth would be more similar (because of specificity).N

1 and N

2 are the number of IS-A links from C

1 and C

2 respectively to

the most specific common subsumer concept C. H is the number of IS-A links from C to the root of ontology.

Li Measure:

Li combines the shortest path and the depth of ontology information in a non-linear function.L stands for the shortest path between two concepts, α and β are scaling factors. H is same as in Wu & Palmer measure.

simW & P(C1 ,C2)=2H

(N 1+N2+2H)

simLi(C1 ,C2)=e−αL⋅eβH−e−βH

e βH+e−βH

Semantic Similarity (Path Based)

Leacock & Chodorow Measure:

This is almost the same as Wu & Palmer method, except logarithmic smoothing and removal of depth factor from denominator.As in the Li Measure, L is the shortest path between concepts C

1 and

C2. H is the number of IS-A links from C to the root of ontology.

Mao Measure:

Mao measure considers the generality of the concepts by taking into account, the number of descendants.L stands for the shortest path between two concepts, d(C) stands for number of descendants of C. δ is a constant (usually chosen as 0.9).

simL&C (C1 ,C2)=−logL

2H

simMao(C1 ,C2)=δ

L log2(1+d (C1)+d (C2))

Semantic Similarity (IC Based)

The intuition behind information content is that, more frequent terms are more general and hence provide less “information”:

freq(C) is the frequency of concept C, and freq(root) is the frequency of root concept of the ontology. Frequency includes the frequencies of subsumed concepts in an IS-A hierarchy.

We call concept C the most informative subsumer of two concepts C1

and C2 i.e. IC

mis(C

1,C

2) if concept C has the least probability among all

shared subsumer between two concepts (thus most informative).

Resnik Measure:

More the information two terms share, the more similar they are.

IC (C )=−log p(C )=−logfreq(C )

freq (root )

simResnik(C1 ,C2)=ICmis(C1 ,C2)

Semantic Similarity (IC Based)Jiang Measure:

Jiang measure considers the information content of each term apart from shared information content. It is an inverted measurement.The distance between two concepts is the amount of information needed to fully describe both the concepts, excluding the amount of information that is common to both of them.

Lin Measure:

Lin measure also the information contents of each term, but uses them differently than Jiang. It takes ratio instead of difference.Since IC

mis(C

1,C

2) < IC (C

1) and IC (C

2) the similarity value is normalized

between 1 ( similar concepts) and 0.

dist Jiang(C1 ,C2)=IC (C1)+ IC (C2)−2ICmis(C1 ,C2)

simLin(C1 ,C2)=2ICmis(C1 ,C2)

IC (C1)+ IC (C2)

Semantic Similarity

Correlation with human judgements

Method Type Correlation

Wu & Palmer

Path 0.74

Li Path 0.82

Leacock Path 0.82

Resnik IC 0.79

Lin IC 0.82

Jiang IC 0.83

Method Type Correlation

Wu & Palmer

Path 0.67

Li Path 0.70

Leacock Path 0.74

Resnik IC 0.71

Lin IC 0.72

Jiang IC 0.71

WordNet Ontology MeSH Ontology

Information Retrieval

SSRM: IR with semantics ... (0/3)

VSM Revisited:● Similarity in VSM is the cosine inner product:

● Each dimension corresponds to a separate term. q and d are n-dimensional vectors with weights for each term.

● qi and d

i are weights of the query and document terms

● The document term weight, di = tf

i • idf

i

● Specifically, I would talk about SSRM algorithm (Semantic Similarity Retrieval Model), where we modify the query term weights to consider semantic similarity.

sim(q , d )=∑

i

q i d i

√∑i

q i2⋅√∑

i

d i2



Query Re-weighting:● Query can contain related (semantically similar) terms

Query: free scientific computing software

● We need to re-weight the query terms to stress a particular concept we are searching.

● qi and q

i' are old and new weights respectively

● i and j refer to different terms in the query.

qi '=qi+ ∑sim (i , j )⩾t

i≠ j

q j⋅sim(i , j)



Query Expansion:● New terms that might be semantically similar to query terms. We

“expand” the queries by adding new terms in the neighbourhood of the query term, in the ontology.

● Adding such terms would affect weights of existing terms.

● n is the number of hyponymsfor each expanded term j.

qi '= { ∑sim (i , j)⩾T

i≠ j q j

n⋅sim(i , j) i is a new term

qi+ ∑sim (i , j)⩾T

i≠ j q j

n⋅sim(i , j ) i had weight qi



Document Similarity:● After we have the expanded and re-weighted query vectors and the

document vectors using tf-idf, we calculate the query-document similarity between query q and document d as:

● Properties:● Symmetric.● Normalized in [0,1].● Consistent behaviour.● Can be easily tweaked for document-document similarity.

sim (q , d )=

∑i∑

j

qi⋅d j⋅sim(i , j )

∑i∑

j

qi⋅d j


SSRM: At a glance


SSRM Implementation Notes:● Quadratic time complexity as opposed to VSM.

● Similarity between every pair or terms can be hashed.● Expensive to expand and re-weight the document vectors as well,

so only re-weight and expand queries. But expanding one of the vectors should incorporate enough semantic info.

● Thresholds (t, T) need to be adjusted for optimal behaviour.

● Although behaviour of SSRM is consistent, SSRM won't result in sim(d,d) = 1 i.e. even exact search won't give a similarity value of 1.

● I had proposed the following formula last summer and the results on MeSH were quite satisfactory:

sim(q , d )=∑

i∑

j

qi⋅d j⋅maxsimi

∑i∑

j

q i⋅d j

where maxsimi=maxj

sim(i , j)

Experimental Results

IR on OSHUMED using MeSH IR on web using WordNet

Future ...

Possible Issues● Negation

● Query: I like pizza Match: I don't like pizza● Antonymy

● Query: Slow runner Match: Fast runner● Role Reversal

● Query: Dog bites man Match: Man bites dog

Further reading● Groupwise Semantic Similarity

● Jaccard Index● simLP, simUI, simGIC

● Statistical Semantic Similarity● LSA: Latent Semantic Analysis● NGD: Normalized Google Distance● PMI: Pointwise Mutual Information

References

● A comparative study of ontology based term similarity measures on PubMed Document Clustering [Xiaodan Zhang, Liping Jing, Xiaohua Hu, Michael Ng, Xiaohua Zhou] [2007].

● Information retrieval by semantic similarity [A. Hilaoutakis, G. Varelas, E. Voutsakis] [2006].

● Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy [Jay J. Jiang, David W. Conrath] [1997].

Thank you!

information retrieval using semantic similarity

Education