chapter 2 modeling modern information retrieval by r. baeza-yates and b. ribeiro-neto

47
Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. R ibeiro-Neto

Upload: gervase-bond

Post on 17-Dec-2015

417 views

Category:

Documents


27 download

TRANSCRIPT

Page 1: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Chapter 2 Modeling

Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Page 2: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Introduction

Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words)

which has some meaning of its own (usually a noun). Advantages:

Simple The semantic of the documents and of the user

information need can be naturally expressed through sets of index terms.

Page 3: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

IR Models

Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

Page 4: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

A taxonomy of information retrieval models

Retrieval:Ad hoc

Filtering

Classic Models

Browsing

USER

TASK

BooleanVector

Probabilistic

Structured Models

Non-overlapping listsProximal Nodes

FlatStructured Guided

Hypertext

Browsing

FuzzyExtended Boolean

Set Theoretic

AlgebraicGeneralized VectorLat. Semantic Index

Neural Networks

Inference NetworkBelief Network

Probabilistic

Page 5: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Structure Guided Hypertext

FlatHypertext

FlatBrowsing

StructuredClassicSet TheoreticAlgebraicProbabilistic

ClassicSet TheoreticAlgebraicProbabilistic

Retrieval

Full Text+Structure

Full TextIndex Terms

Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Page 6: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Retrieval : Ad hoc and Filtering

Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system.

Routing (Filtering): The queries remain relatively static while new documents come into the system

Page 7: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

A formal characterization of IR models

D : A set composed of logical views (or representation) for the documents in the collection.

Q : A set composed of logical views (or representation) for the user information needs (queries).

F : A framework for modeling document representations, queries, and their relationships.

R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.

Page 8: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Define

ki : A generic index term K : The set of all index terms {k1,…,kt} wi,j : A weight associated with index term ki of a document dj

gi: A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )

Page 9: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Classic IR Model

Basic concepts : Each document is described by a set of representative keywords called index terms.

Assign a numerical weights to distinct relevance between index terms.

Page 10: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Boolean model

Binary decision criterion Data retrieval model Advantage

clean formalism, simplicity Disadvantage

It is not simple to translate an information need into a Boolean expression.

exact matching may lead to retrieval of too few or too many documents

Page 11: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Example:

Can be represented as a disjunction of conjunction vectors (in DNF). Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)

Formal definition For the Boolean model, the index term weight are all binary. A query is a conventional Boolean expression, which can be t

ransformed to a disjunctive normal form if (qcc )(ki, wi,j=gi(qcc)) dnfq

0

1),( qdsim j

dnfq

Page 12: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Vector model

Assign non-binary weights to index terms in queries and in documents. => TFxIDF

Compute the similarity between documents and query. => Sim(Dj, Q)

More precise than Boolean model.

Page 13: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

The IR problem A clustering problem

We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects.

Intra-cluster: What are the features which better describe the objects

in the set A? Inter-cluster:

What are the features which better distinguish the objects in the set A?

Page 14: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

TF: inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents.

IDF : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.

Idea for TFxIDF

Page 15: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Vector Model (1/4)

Index terms are assigned positive and non-binary weights.

The index terms in the query are also weighted.

Term weights are used to compute the degree of similarity between documents and the user query. Then, retrieved documents are sorted in decreasing order.

),,,(

),,,(

,,2,1

,,2,1

qtqq

jtjjj

wwwq

wwwd

Page 16: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Vector Model (2/4)

Degree of similarity

t

i qi

t

i ji

t

i qiji

j

jj

ww

ww

qd

qdqdsim

1

2,1

2,

1 ,,

||||),(

dj

q

Figure 2.4 The cosine of is adoptedas sim(dj,q)

Page 17: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Vector Model (3/4) Definition

normalized frequency

inverse document frequency

term-weighting schemes

query-term weights

jll

jiji freq

freqf

,

,, max

ii n

Nidf log

ijiji idffreqw ,,

iqll

qiqi n

N

freq

freqw log)

max

5.05.0(

,

,,

Page 18: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Vector Model (4/4)

Advantages its term-weighting scheme improves retrieval

performance its partial matching strategy allows retrieval of

documents that approximate the query conditions its cosine ranking formula sorts the documents

according to their degree of similarity to the query Disadvantage

The assumption of mutual independence between index terms

Page 19: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

v1: (1,0) (1,0)v2: (1,1) (0,1)v3: (0,1) (-1,1)

Cos(v1,v2)=1/2Cos(v2,v3)=1/2Cos(v1,v3)=0

Cos(v1,v2)=0Cos(v2,v3)=1/2Cos(v1,v3)=-1/2

Orthogonal

v1

v3 v2

Page 20: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (1/6)

Introduced by Roberston and Sparck Jones, 1976 Also called binary independence retrieval (BIR) mode

l Idea: Given a user query q, and the ideal answer set of the relev

ant documents, the problem is to specify the properties for this set. i.e. the probabilistic model tries to estimate the probability that the

user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)

Page 21: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (2/6)

Definition All index term weights are all binary i.e., wi,j {0,1} Let R be the set of documents know to be relevant to query q Let be the complement of R Let be the probability that the document dj is relevan

t to the query q Let be the probability that the document dj is nonele

vant to query q

R)|( jdRP

)|( jdRP

Page 22: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (3/6)

The similarity sim(dj,q) of the document dj to the query q is defined as the ratio

Using Bayes’ rule,

P(R) stands for the probability that a document randomly selected from the entire collection is relevant

stands for the probability of randomly selecting the document dj from the set R of relevant documents

)|Pr(

)|Pr(),(

j

jj

dR

dRqdsim

)Pr()|Pr(

)Pr()|Pr(),(

RRd

RRdqdsim

j

jj

)|( RdP j

Page 23: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (4/6)

Assuming independence of index terms and given q=(d1, d2, …, dt),

t

iiij

t

iiij

RdkRd

RdkRd

1

1

)|Pr()|Pr(

)|Pr()|Pr(

)Pr(

)Pr(log

)|Pr(

)|Pr(log),(

R

R

Rd

Rdqdsim

j

jj

t

iii

t

iii

j

Rdk

Rdkqdsim

1

1

)|Pr(

)|Pr(log),(

Page 24: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (5/6)

Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R

stands for the probability that the index term ki is not present in a document randomly selected from the set R

)|Pr( Rki

Page 25: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Probabilistic Model (6/6)

1)( 0)(

1)( 0)(

)|Pr()|Pr(

)|Pr()|Pr(),(

ji ji

ji ji

dg dg ii

dg dg ii

jRkRk

RkRkqdsim

1)|Pr()|Pr( RkRk ii

t

i i

i

i

ijiqij

RkP

RkP

RkP

RkPwwqdsim

1,, )|(

)|(1log

)|(1

)|(log),(

t

i i

i

i

ij

RkP

RkP

RkP

RkPqdsim

1 )|(

)|(1log

)|(1

)|(log),(

Page 26: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Estimation of Term Relevance

In the very beginning:

Next, the ranking can be improved as follows:

For small values for V

N

dfRk

Rk

ii

i

)|Pr(

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

)|Pr(

)|Pr(

1

5.0)|Pr(

1

5.0)|Pr(

VN

VdfRk

V

VRk

iii

ii

Let V be a subset of the documents initially retrieved

1)|Pr(

1)|Pr(

VN

VdfRk

V

VRk

VV

iii

VV

ii

i

i

Page 27: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Alternative Set Theoretic Models

Fuzzy Set Model Extended Boolean Model

Page 28: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Fuzzy Theory

A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA

Let A and B be two fuzzy subsets of U,

),min(

),max(

1

BABA

BABA

AA

Page 29: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Fuzzy Information Retrieval

Using a term-term correlation matrix

Define a fuzzy set associated to each index term ki.

If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1

If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0

vuvu

vuvu dfdfdf

dfc

,

,,

ji dk

liji cd )1(1)( ,

Page 30: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Example

Disjunctive Normal Form

)( cbadnf kkkq

)()()( cbacbacbadnf kkkkkkkkkq

)1)(1()(

)1()(

)()()()(

,,,,,

,,,,,

,,,,,

jcjbjajcba

jcjbjajcba

jcjbjajcjbjajcba

uuudu

uuudu

uuududududu

))(1())(1())(1(1

)1(1)(

,,,,,,

3

1111

jcbajcbajcba

cci

ccccccjq

ddd

di

Page 31: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Algebraic Sum and Product

The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function.

The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function.

More smooth than max and min functions.

Page 32: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Alternative Algebraic Models

Generalized Vector Space Model Latent Semantic Model

Page 33: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Latent Semantic Indexing (1/5)

Let A be a term-document association matrix with m rows and n columns.

Latent semantic indexing decomposes A using singular value decompositions.

U (mm) is the matrix of eigenvectors derived from the term-to-term correlation matrix (AAT)

V (nn) is the matrix of eigenvectors derived from the the document-to-document matrix (ATA)

is an mn diagonal matrix of singular values, where rmin(t,N) is the rank of A.

TVUA

Page 34: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Latent Semantic Indexing (2/5)

Consider now only the s largest singular values of S, and their corresponding columns in U and V. (The remaining singular values of are deleted).

The resultant matrix As (rank s) is closest to the original matrix A in the least square sense.

s<r is the dimensionality of a reduced concept space.

00

0,

DVUA T

ssss

Page 35: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Latent Semantic Indexing (3/5)

The selection of s attempts to balance two opposing effects: s should be large enough to allow fitting all the structure in t

he real data s should be small enough to allow filtering out all the non-rel

evant representational details

Us={u1, u2, …, us} are the s principle components of column space (document space) Rm

Vs={v1, v2, …, vs} are the s principle components of row space (term space) Rn

Page 36: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Latent Semantic Indexing (4/5) Consider the relationship between any two documents

is the projected vector for document di (RmRs) is the projected vector for term vector ti (RnRs)i

Ts dU

Tssss

Tssss

Tsss

Tsss

Tsss

TTsss

T

VV

VV

VUUV

VUVUAAss

))((

)(

iTs tV

Page 37: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Latent Semantic Indexing (5/5)

To rank documents with regard to a given user query, we model the query as a pseudo-document in the matrix A (original). Assume the query is modeled as the document with number

k. Then the kth row in the matrix provides the ranks of all

documents with respect to this query.ss

AAT

Page 38: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Speedup The matrix vector multiplication requires a total of Nt

scalar multiplications.

While requires only (n+m)s scalar multiplications.

qAT

))(( qUVqA Tsss

Ts

Page 39: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Alternative Probabilistic Model

Bayesian Networks Inference Network Model Belief Network Model

Page 40: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Bayesian Network

Let xi be a node in a Bayesian network G and xi be

the set of parent nodes of xi.

The influence of xi on xi can be specified by any set of

functions that satisfy:

P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)

1),(0

1),(

i

i

i

xii

xxii

xF

xF

Page 41: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (1/6)

The probability spaceThe set K={k1, k2, …, kt} is the universe. To each subset u is associated a vector such that gi( )=1 kiu.

Random variables To each index term ki is associated a binary random variable.

k

k

Page 42: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (2/6)

Concept space A document dj is represented as a concept composed of the t

erms used to index dj. A user query q is also represented as a concept composed of

the terms used to index q. Both user query and document are modeled as subsets of in

dex terms. Probability distribution P over K

t

u

uP

uPucPcP

)2

1()(

)()|()(

Page 43: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (3/6)

A query is modeled as a network node This variable is set to 1 whenever q completely covers the co

ncept space K P(q) computes the degree of coverage of the space K by q

A document dj is modeled as a network node This random variable is 1 to indicate that dj completely cove

rs the concept space K P(dj) computes the degree of coverage of the space K by dj

Page 44: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (4/6)

Page 45: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (5/6)

Assumption P(dj |q) is adopted as the rank of the document dj with respe

ct to the query q.

kj

uj

ujj

jj

kPkqPkdP

uPuqPudP

uPuqdPqdP

qPqdPqdP

)()|()|(

)()|()|(

)()|()(

)(/)()|(

Page 46: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Belief Network Model (6/6)

Specify the conditional probabilities as follows:

Thus, the belief network model can be tuned to subsume the vector model.

otherwise

qgkkifkqP

otherwise

dgkkifkdP

iiw

w

jiiw

w

j

ti qi

qi

ti ji

ji

1)(

0)|(

1)(

0)|(

12,

,

12,

,

Page 47: Chapter 2 Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto

Comparison

Belief network model Belief network model is based on set-theoretic view Belief network model provides a separation between

the document and the query Belief network model is able to reproduce any ranking

strategy generated by the inference network model Inference network model