chapter 2 modeling modern information retrieval by r. baeza-yates and b. ribeiro-neto

Chapter 2 Modeling

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Introduction

Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words)

which has some meaning of its own (usually a noun). Advantages:

Simple The semantic of the documents and of the user

information need can be naturally expressed through sets of index terms.

IR Models

Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

A taxonomy of information retrieval models

Retrieval:Ad hoc

Filtering

Classic Models

Browsing

USER

TASK

BooleanVector

Probabilistic

Structured Models

Non-overlapping listsProximal Nodes

FlatStructured Guided

Hypertext

Browsing

FuzzyExtended Boolean

Set Theoretic

AlgebraicGeneralized VectorLat. Semantic Index

Neural Networks

Inference NetworkBelief Network

Probabilistic

Structure Guided Hypertext

FlatHypertext

FlatBrowsing

StructuredClassicSet TheoreticAlgebraicProbabilistic

ClassicSet TheoreticAlgebraicProbabilistic

Retrieval

Full Text+Structure

Full TextIndex Terms

Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Retrieval : Ad hoc and Filtering

Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system.

Routing (Filtering): The queries remain relatively static while new documents come into the system

A formal characterization of IR models

D : A set composed of logical views (or representation) for the documents in the collection.

Q : A set composed of logical views (or representation) for the user information needs (queries).

F : A framework for modeling document representations, queries, and their relationships.

R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.

Define

ki : A generic index term K : The set of all index terms {k1,…,kt} wi,j : A weight associated with index term ki of a document dj

gi: A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )

Classic IR Model

Basic concepts : Each document is described by a set of representative keywords called index terms.

Assign a numerical weights to distinct relevance between index terms.

Boolean model

Binary decision criterion Data retrieval model Advantage

clean formalism, simplicity Disadvantage

It is not simple to translate an information need into a Boolean expression.

exact matching may lead to retrieval of too few or too many documents

Example:

Can be represented as a disjunction of conjunction vectors (in DNF). Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)

Formal definition For the Boolean model, the index term weight are all binary. A query is a conventional Boolean expression, which can be t

ransformed to a disjunctive normal form if (qcc )(ki, wi,j=gi(qcc)) dnfq

0

1),( qdsim j

dnfq

Vector model

Assign non-binary weights to index terms in queries and in documents. => TFxIDF

Compute the similarity between documents and query. => Sim(Dj, Q)

More precise than Boolean model.

The IR problem A clustering problem

We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects.

Intra-cluster: What are the features which better describe the objects

in the set A? Inter-cluster:

What are the features which better distinguish the objects in the set A?

TF: inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents.

IDF : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.

Idea for TFxIDF

Vector Model (1/4)

Index terms are assigned positive and non-binary weights.

The index terms in the query are also weighted.

Term weights are used to compute the degree of similarity between documents and the user query. Then, retrieved documents are sorted in decreasing order.

),,,(

),,,(

,,2,1

,,2,1

qtqq

jtjjj

wwwq

wwwd

Vector Model (2/4)

Degree of similarity

t

i qi

t

i ji

t

i qiji

j

jj

ww

ww

qd

qdqdsim

1

2,1

2,

1 ,,

||||),(

dj

q

Figure 2.4 The cosine of is adoptedas sim(dj,q)

Vector Model (3/4) Definition

normalized frequency

inverse document frequency

term-weighting schemes

query-term weights

jll

jiji freq

freqf

,

,, max

ii n

Nidf log

ijiji idffreqw ,,

iqll

qiqi n

N

freq

freqw log)

max

5.05.0(

,

,,

Vector Model (4/4)

Advantages its term-weighting scheme improves retrieval

performance its partial matching strategy allows retrieval of

documents that approximate the query conditions its cosine ranking formula sorts the documents

according to their degree of similarity to the query Disadvantage

The assumption of mutual independence between index terms

v1: (1,0) (1,0)v2: (1,1) (0,1)v3: (0,1) (-1,1)

Cos(v1,v2)=1/2Cos(v2,v3)=1/2Cos(v1,v3)=0

Cos(v1,v2)=0Cos(v2,v3)=1/2Cos(v1,v3)=-1/2

Orthogonal

v1

v3 v2

Probabilistic Model (1/6)

Introduced by Roberston and Sparck Jones, 1976 Also called binary independence retrieval (BIR) mode

l Idea: Given a user query q, and the ideal answer set of the relev

ant documents, the problem is to specify the properties for this set. i.e. the probabilistic model tries to estimate the probability that the

user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)


Definition All index term weights are all binary i.e., wi,j {0,1} Let R be the set of documents know to be relevant to query q Let be the complement of R Let be the probability that the document dj is relevan

t to the query q Let be the probability that the document dj is nonele

vant to query q

R)|( jdRP

)|( jdRP


The similarity sim(dj,q) of the document dj to the query q is defined as the ratio

Using Bayes’ rule,

P(R) stands for the probability that a document randomly selected from the entire collection is relevant

stands for the probability of randomly selecting the document dj from the set R of relevant documents

)|Pr(

)|Pr(),(

j

jj

dR

dRqdsim

)Pr()|Pr(

)Pr()|Pr(),(

RRd

RRdqdsim

j

jj

)|( RdP j


Assuming independence of index terms and given q=(d1, d2, …, dt),

t

iiij

t

iiij

RdkRd

RdkRd

1

1

)|Pr()|Pr(

)|Pr()|Pr(

)Pr(

)Pr(log

)|Pr(

)|Pr(log),(

R

R

Rd

Rdqdsim

j

jj

t

iii

t

iii

j

Rdk

Rdkqdsim

1

1

)|Pr(

)|Pr(log),(


Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R

stands for the probability that the index term ki is not present in a document randomly selected from the set R

)|Pr( Rki


1)( 0)(

1)( 0)(

)|Pr()|Pr(

)|Pr()|Pr(),(

ji ji

ji ji

dg dg ii

dg dg ii

jRkRk

RkRkqdsim

1)|Pr()|Pr( RkRk ii

t

i i

i

i

ijiqij

RkP

RkP

RkP

RkPwwqdsim

1,, )|(

)|(1log

)|(1

)|(log),(

t

i i

i

i

ij

RkP

RkP

RkP

RkPqdsim

1 )|(

)|(1log

)|(1

)|(log),(

Estimation of Term Relevance

In the very beginning:

Next, the ranking can be improved as follows:

For small values for V

N

dfRk

Rk

ii

i

)|Pr(

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

)|Pr(

)|Pr(

1

5.0)|Pr(

1

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

Let V be a subset of the documents initially retrieved

1)|Pr(

1)|Pr(

VN

VdfRk

V

VRk

VV

iii

VV

ii

i

i

Alternative Set Theoretic Models

Fuzzy Set Model Extended Boolean Model

Fuzzy Theory

A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA

Let A and B be two fuzzy subsets of U,

),min(

),max(

1

BABA

BABA

AA

Fuzzy Information Retrieval

Using a term-term correlation matrix

Define a fuzzy set associated to each index term ki.

If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1

If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0

vuvu

vuvu dfdfdf

dfc

,

,,

ji dk

liji cd )1(1)( ,

Example

Disjunctive Normal Form

)( cbadnf kkkq

)()()( cbacbacbadnf kkkkkkkkkq

)1)(1()(

)1()(

)()()()(

,,,,,

,,,,,

,,,,,

jcjbjajcba

jcjbjajcba

jcjbjajcjbjajcba

uuudu

uuudu

uuududududu

))(1())(1())(1(1

)1(1)(

,,,,,,

3

1111

jcbajcbajcba

cci

ccccccjq

ddd

di

Algebraic Sum and Product

The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function.

The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function.

More smooth than max and min functions.

Alternative Algebraic Models

Generalized Vector Space Model Latent Semantic Model

Latent Semantic Indexing (1/5)

Let A be a term-document association matrix with m rows and n columns.

Latent semantic indexing decomposes A using singular value decompositions.

U (mm) is the matrix of eigenvectors derived from the term-to-term correlation matrix (AAT)

V (nn) is the matrix of eigenvectors derived from the the document-to-document matrix (ATA)

is an mn diagonal matrix of singular values, where rmin(t,N) is the rank of A.

TVUA


Consider now only the s largest singular values of S, and their corresponding columns in U and V. (The remaining singular values of are deleted).

The resultant matrix As (rank s) is closest to the original matrix A in the least square sense.

s<r is the dimensionality of a reduced concept space.

00

0,

DVUA T

ssss


The selection of s attempts to balance two opposing effects: s should be large enough to allow fitting all the structure in t

he real data s should be small enough to allow filtering out all the non-rel

evant representational details

Us={u1, u2, …, us} are the s principle components of column space (document space) Rm

Vs={v1, v2, …, vs} are the s principle components of row space (term space) Rn

Latent Semantic Indexing (4/5) Consider the relationship between any two documents

is the projected vector for document di (RmRs) is the projected vector for term vector ti (RnRs)i

Ts dU

Tssss

Tssss

Tsss

Tsss

Tsss

TTsss

T

VV

VV

VUUV

VUVUAAss

))((

)(

iTs tV


To rank documents with regard to a given user query, we model the query as a pseudo-document in the matrix A (original). Assume the query is modeled as the document with number

k. Then the kth row in the matrix provides the ranks of all

documents with respect to this query.ss

AAT

Speedup The matrix vector multiplication requires a total of Nt

scalar multiplications.

While requires only (n+m)s scalar multiplications.

qAT

))(( qUVqA Tsss

Ts

Alternative Probabilistic Model

Bayesian Networks Inference Network Model Belief Network Model

Bayesian Network

Let xi be a node in a Bayesian network G and xi be

the set of parent nodes of xi.

The influence of xi on xi can be specified by any set of

functions that satisfy:

P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)

1),(0

1),(

i

i

i

xii

xxii

xF

xF

Belief Network Model (1/6)

The probability spaceThe set K={k1, k2, …, kt} is the universe. To each subset u is associated a vector such that gi( )=1 kiu.

Random variables To each index term ki is associated a binary random variable.

k

k


Concept space A document dj is represented as a concept composed of the t

erms used to index dj. A user query q is also represented as a concept composed of

the terms used to index q. Both user query and document are modeled as subsets of in

dex terms. Probability distribution P over K

t

u

uP

uPucPcP

)2

1()(

)()|()(


A query is modeled as a network node This variable is set to 1 whenever q completely covers the co

ncept space K P(q) computes the degree of coverage of the space K by q

A document dj is modeled as a network node This random variable is 1 to indicate that dj completely cove

rs the concept space K P(dj) computes the degree of coverage of the space K by dj


Assumption P(dj |q) is adopted as the rank of the document dj with respe

ct to the query q.

kj

uj

ujj

jj

kPkqPkdP

uPuqPudP

uPuqdPqdP

qPqdPqdP

)()|()|(

)()|()|(

)()|()(

)(/)()|(


Specify the conditional probabilities as follows:

Thus, the belief network model can be tuned to subsume the vector model.

otherwise

qgkkifkqP

otherwise

dgkkifkdP

iiw

w

jiiw

w

j

ti qi

qi

ti ji

ji

1)(

0)|(

1)(

0)|(

12,

,

12,

,

Comparison

Belief network model Belief network model is based on set-theoretic view Belief network model provides a separation between

the document and the query Belief network model is able to reproduce any ranking

strategy generated by the inference network model Inference network model

chapter 2 modeling modern information retrieval by r. baeza-yates and b. ribeiro-neto

Documents

j slide

g i q cc slide

system slide

sets of index terms

index term weight

document d j g i

generic index term

new documents