chapter 2 modeling modern information retrieval by r. baeza-yates and b. ribeiro-neto
TRANSCRIPT
Chapter 2 Modeling
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Introduction
Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words)
which has some meaning of its own (usually a noun). Advantages:
Simple The semantic of the documents and of the user
information need can be naturally expressed through sets of index terms.
IR Models
Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).
A taxonomy of information retrieval models
Retrieval:Ad hoc
Filtering
Classic Models
Browsing
USER
TASK
BooleanVector
Probabilistic
Structured Models
Non-overlapping listsProximal Nodes
FlatStructured Guided
Hypertext
Browsing
FuzzyExtended Boolean
Set Theoretic
AlgebraicGeneralized VectorLat. Semantic Index
Neural Networks
Inference NetworkBelief Network
Probabilistic
Structure Guided Hypertext
FlatHypertext
FlatBrowsing
StructuredClassicSet TheoreticAlgebraicProbabilistic
ClassicSet TheoreticAlgebraicProbabilistic
Retrieval
Full Text+Structure
Full TextIndex Terms
Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering
Ad hoc (Search): The documents in the collection remain relatively static while new queries are submitted to the system.
Routing (Filtering): The queries remain relatively static while new documents come into the system
A formal characterization of IR models
D : A set composed of logical views (or representation) for the documents in the collection.
Q : A set composed of logical views (or representation) for the user information needs (queries).
F : A framework for modeling document representations, queries, and their relationships.
R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.
Define
ki : A generic index term K : The set of all index terms {k1,…,kt} wi,j : A weight associated with index term ki of a document dj
gi: A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )
Classic IR Model
Basic concepts : Each document is described by a set of representative keywords called index terms.
Assign a numerical weights to distinct relevance between index terms.
Boolean model
Binary decision criterion Data retrieval model Advantage
clean formalism, simplicity Disadvantage
It is not simple to translate an information need into a Boolean expression.
exact matching may lead to retrieval of too few or too many documents
Example:
Can be represented as a disjunction of conjunction vectors (in DNF). Q= qa(qbqc)=(1,1,1) (1,1,0) (1,0,0)
Formal definition For the Boolean model, the index term weight are all binary. A query is a conventional Boolean expression, which can be t
ransformed to a disjunctive normal form if (qcc )(ki, wi,j=gi(qcc)) dnfq
0
1),( qdsim j
dnfq
Vector model
Assign non-binary weights to index terms in queries and in documents. => TFxIDF
Compute the similarity between documents and query. => Sim(Dj, Q)
More precise than Boolean model.
The IR problem A clustering problem
We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects.
Intra-cluster: What are the features which better describe the objects
in the set A? Inter-cluster:
What are the features which better distinguish the objects in the set A?
TF: inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents.
IDF : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.
Idea for TFxIDF
Vector Model (1/4)
Index terms are assigned positive and non-binary weights.
The index terms in the query are also weighted.
Term weights are used to compute the degree of similarity between documents and the user query. Then, retrieved documents are sorted in decreasing order.
),,,(
),,,(
,,2,1
,,2,1
qtqq
jtjjj
wwwq
wwwd
Vector Model (2/4)
Degree of similarity
t
i qi
t
i ji
t
i qiji
j
jj
ww
ww
qd
qdqdsim
1
2,1
2,
1 ,,
||||),(
dj
q
Figure 2.4 The cosine of is adoptedas sim(dj,q)
Vector Model (3/4) Definition
normalized frequency
inverse document frequency
term-weighting schemes
query-term weights
jll
jiji freq
freqf
,
,, max
ii n
Nidf log
ijiji idffreqw ,,
iqll
qiqi n
N
freq
freqw log)
max
5.05.0(
,
,,
Vector Model (4/4)
Advantages its term-weighting scheme improves retrieval
performance its partial matching strategy allows retrieval of
documents that approximate the query conditions its cosine ranking formula sorts the documents
according to their degree of similarity to the query Disadvantage
The assumption of mutual independence between index terms
v1: (1,0) (1,0)v2: (1,1) (0,1)v3: (0,1) (-1,1)
Cos(v1,v2)=1/2Cos(v2,v3)=1/2Cos(v1,v3)=0
Cos(v1,v2)=0Cos(v2,v3)=1/2Cos(v1,v3)=-1/2
Orthogonal
v1
v3 v2
Probabilistic Model (1/6)
Introduced by Roberston and Sparck Jones, 1976 Also called binary independence retrieval (BIR) mode
l Idea: Given a user query q, and the ideal answer set of the relev
ant documents, the problem is to specify the properties for this set. i.e. the probabilistic model tries to estimate the probability that the
user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)
Probabilistic Model (2/6)
Definition All index term weights are all binary i.e., wi,j {0,1} Let R be the set of documents know to be relevant to query q Let be the complement of R Let be the probability that the document dj is relevan
t to the query q Let be the probability that the document dj is nonele
vant to query q
R)|( jdRP
)|( jdRP
Probabilistic Model (3/6)
The similarity sim(dj,q) of the document dj to the query q is defined as the ratio
Using Bayes’ rule,
P(R) stands for the probability that a document randomly selected from the entire collection is relevant
stands for the probability of randomly selecting the document dj from the set R of relevant documents
)|Pr(
)|Pr(),(
j
jj
dR
dRqdsim
)Pr()|Pr(
)Pr()|Pr(),(
RRd
RRdqdsim
j
jj
)|( RdP j
Probabilistic Model (4/6)
Assuming independence of index terms and given q=(d1, d2, …, dt),
t
iiij
t
iiij
RdkRd
RdkRd
1
1
)|Pr()|Pr(
)|Pr()|Pr(
)Pr(
)Pr(log
)|Pr(
)|Pr(log),(
R
R
Rd
Rdqdsim
j
jj
t
iii
t
iii
j
Rdk
Rdkqdsim
1
1
)|Pr(
)|Pr(log),(
Probabilistic Model (5/6)
Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R
stands for the probability that the index term ki is not present in a document randomly selected from the set R
)|Pr( Rki
Probabilistic Model (6/6)
1)( 0)(
1)( 0)(
)|Pr()|Pr(
)|Pr()|Pr(),(
ji ji
ji ji
dg dg ii
dg dg ii
jRkRk
RkRkqdsim
1)|Pr()|Pr( RkRk ii
t
i i
i
i
ijiqij
RkP
RkP
RkP
RkPwwqdsim
1,, )|(
)|(1log
)|(1
)|(log),(
t
i i
i
i
ij
RkP
RkP
RkP
RkPqdsim
1 )|(
)|(1log
)|(1
)|(log),(
Estimation of Term Relevance
In the very beginning:
Next, the ranking can be improved as follows:
For small values for V
N
dfRk
Rk
ii
i
)|Pr(
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
)|Pr(
)|Pr(
1
5.0)|Pr(
1
5.0)|Pr(
VN
VdfRk
V
VRk
iii
ii
Let V be a subset of the documents initially retrieved
1)|Pr(
1)|Pr(
VN
VdfRk
V
VRk
VV
iii
VV
ii
i
i
Alternative Set Theoretic Models
Fuzzy Set Model Extended Boolean Model
Fuzzy Theory
A fuzzy subset A of a universe U is characterized by a membership function uA: U{0,1} which associates with each element uU a number uA
Let A and B be two fuzzy subsets of U,
),min(
),max(
1
BABA
BABA
AA
Fuzzy Information Retrieval
Using a term-term correlation matrix
Define a fuzzy set associated to each index term ki.
If a term kl is strongly related to ki, that is ci,l ~1, then ui(dj)~1
If a term kl is loosely related to ki, that is ci,l ~0, then ui(dj)~0
vuvu
vuvu dfdfdf
dfc
,
,,
ji dk
liji cd )1(1)( ,
Example
Disjunctive Normal Form
)( cbadnf kkkq
)()()( cbacbacbadnf kkkkkkkkkq
)1)(1()(
)1()(
)()()()(
,,,,,
,,,,,
,,,,,
jcjbjajcba
jcjbjajcba
jcjbjajcjbjajcba
uuudu
uuudu
uuududududu
))(1())(1())(1(1
)1(1)(
,,,,,,
3
1111
jcbajcbajcba
cci
ccccccjq
ddd
di
Algebraic Sum and Product
The degree of membership in a disjunctive fuzzy set is computed using an algebraic sum, instead of max function.
The degree of membership in a conjunctive fuzzy set is computed using an algebraic product, instead of min function.
More smooth than max and min functions.
Alternative Algebraic Models
Generalized Vector Space Model Latent Semantic Model
Latent Semantic Indexing (1/5)
Let A be a term-document association matrix with m rows and n columns.
Latent semantic indexing decomposes A using singular value decompositions.
U (mm) is the matrix of eigenvectors derived from the term-to-term correlation matrix (AAT)
V (nn) is the matrix of eigenvectors derived from the the document-to-document matrix (ATA)
is an mn diagonal matrix of singular values, where rmin(t,N) is the rank of A.
TVUA
Latent Semantic Indexing (2/5)
Consider now only the s largest singular values of S, and their corresponding columns in U and V. (The remaining singular values of are deleted).
The resultant matrix As (rank s) is closest to the original matrix A in the least square sense.
s<r is the dimensionality of a reduced concept space.
00
0,
DVUA T
ssss
Latent Semantic Indexing (3/5)
The selection of s attempts to balance two opposing effects: s should be large enough to allow fitting all the structure in t
he real data s should be small enough to allow filtering out all the non-rel
evant representational details
Us={u1, u2, …, us} are the s principle components of column space (document space) Rm
Vs={v1, v2, …, vs} are the s principle components of row space (term space) Rn
Latent Semantic Indexing (4/5) Consider the relationship between any two documents
is the projected vector for document di (RmRs) is the projected vector for term vector ti (RnRs)i
Ts dU
Tssss
Tssss
Tsss
Tsss
Tsss
TTsss
T
VV
VV
VUUV
VUVUAAss
))((
)(
iTs tV
Latent Semantic Indexing (5/5)
To rank documents with regard to a given user query, we model the query as a pseudo-document in the matrix A (original). Assume the query is modeled as the document with number
k. Then the kth row in the matrix provides the ranks of all
documents with respect to this query.ss
AAT
Speedup The matrix vector multiplication requires a total of Nt
scalar multiplications.
While requires only (n+m)s scalar multiplications.
qAT
))(( qUVqA Tsss
Ts
Alternative Probabilistic Model
Bayesian Networks Inference Network Model Belief Network Model
Bayesian Network
Let xi be a node in a Bayesian network G and xi be
the set of parent nodes of xi.
The influence of xi on xi can be specified by any set of
functions that satisfy:
P(x1,x2,x3,x4,x5)=P(x1)P(x2|x1)P(x3|x1)P(x4|x2,x3)P(x5|x3)
1),(0
1),(
i
i
i
xii
xxii
xF
xF
Belief Network Model (1/6)
The probability spaceThe set K={k1, k2, …, kt} is the universe. To each subset u is associated a vector such that gi( )=1 kiu.
Random variables To each index term ki is associated a binary random variable.
k
k
Belief Network Model (2/6)
Concept space A document dj is represented as a concept composed of the t
erms used to index dj. A user query q is also represented as a concept composed of
the terms used to index q. Both user query and document are modeled as subsets of in
dex terms. Probability distribution P over K
t
u
uP
uPucPcP
)2
1()(
)()|()(
Belief Network Model (3/6)
A query is modeled as a network node This variable is set to 1 whenever q completely covers the co
ncept space K P(q) computes the degree of coverage of the space K by q
A document dj is modeled as a network node This random variable is 1 to indicate that dj completely cove
rs the concept space K P(dj) computes the degree of coverage of the space K by dj
Belief Network Model (4/6)
Belief Network Model (5/6)
Assumption P(dj |q) is adopted as the rank of the document dj with respe
ct to the query q.
kj
uj
ujj
jj
kPkqPkdP
uPuqPudP
uPuqdPqdP
qPqdPqdP
)()|()|(
)()|()|(
)()|()(
)(/)()|(
Belief Network Model (6/6)
Specify the conditional probabilities as follows:
Thus, the belief network model can be tuned to subsume the vector model.
otherwise
qgkkifkqP
otherwise
dgkkifkdP
iiw
w
jiiw
w
j
ti qi
qi
ti ji
ji
1)(
0)|(
1)(
0)|(
12,
,
12,
,
Comparison
Belief network model Belief network model is based on set-theoretic view Belief network model provides a separation between
the document and the query Belief network model is able to reproduce any ranking
strategy generated by the inference network model Inference network model