2011.02.07 - slide 1is 240 – spring 2011 prof. ray larson university of california, berkeley...
Post on 22-Dec-2015
214 views
TRANSCRIPT
2011.02.07 - SLIDE 1IS 240 – Spring 2011
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 6: Boolean to Vector
2011.02.07 - SLIDE 2IS 240 – Spring 2011
Today
• Review – IR Models– The Boolean Model– Boolean implementation issues– Extended Boolean Approaches
• Vector Representation
• Term Weights
• Vector Matching
2011.02.07 - SLIDE 4IS 240 – Spring 2011
IR Models
• Set Theoretic Models– Boolean– Fuzzy– Extended Boolean
• Vector Models (Algebraic)
• Probabilistic Models (probabilistic)
2011.02.07 - SLIDE 5IS 240 – Spring 2011
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
∩=∪
∪=∩
∪=∩=
=
=
:Law sDeMorgan'
2011.02.07 - SLIDE 6IS 240 – Spring 2011
Parse Result (Query Tree)
• Z39.50 queries…
Oper: AND
Title XXX and Subject YYY
Operand:Index = TitleValue = XXX
Operand:Index = SubjectValue = YYY
left right
2011.02.07 - SLIDE 7IS 240 – Spring 2011
Parse Results
• Subject XXX and (title yyy and author zzz)
Op: AND
Op: ANDOper:
Index: SubjectValue: XXX
Oper:Index: TitleValue: YYY
Oper:Index: AuthorValue: ZZZ
2011.02.07 - SLIDE 8IS 240 – Spring 2011
Boolean AND Algorithm
2578
152935
100135140155189190195198
28
15100135155189195
289
1215222850687784
100120128135138141150155188189195
AND =
2011.02.07 - SLIDE 9IS 240 – Spring 2011
Boolean OR Algorithm
2578
152935
100135140155189190195198
25789
12152228293550687784
100120128135138141150155188189190195198
289
1215222850687784
100120128135138141150155188189195
OR =
2011.02.07 - SLIDE 10IS 240 – Spring 2011
Boolean AND NOTAlgorithm
2578
152935
100135140155189190195198
57
152935
140190198
289
1215222850687784
100120128135138141150155188189195
AND NOT =
2011.02.07 - SLIDE 11IS 240 – Spring 2011
Boolean Summary
• Advantages– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted, particularly
in complex situations– too much returned, or too little– ordering not well determined
• Dominant IR model in commercial systems until the WWW
2011.02.07 - SLIDE 12IS 240 – Spring 2011
Basic Concepts for Extended Boolean
• Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property)
• Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership.
• Degree of set membership can be used to rank the results of a query
2011.02.07 - SLIDE 13IS 240 – Spring 2011
Fuzzy Sets
• Introduced by Zadeh in 1965.
• If set {A} has value v(A) and {B} has value v(B), where 0 v 1
• v(AB) = min(v(A), v(B))
• v(AB) = max(v(A), v(B))
• v(~A) = 1-v(A)
2011.02.07 - SLIDE 14IS 240 – Spring 2011
Fuzzy Sets
• If we have three documents and three terms…– D1=(.4,.2,1), D2=(0,0,.8), D3=(.7, .4,0)
For search: t1t2 t3
v(D1) = max(.4, .2, 1) = 1v(D2) = max(0, 0, .8) = .8v(D3) = max(.7, .4, 0) = .7
For search: t1t2 t3
v(D1) = min(.4, .2, 1) = .2v(D2) = min(0, 0, .8) = 0v(D3) = min(.7, .4, 0) = 0
2011.02.07 - SLIDE 15IS 240 – Spring 2011
Fuzzy Sets
• Fuzzy set membership of term to document is f(A)[0,1]
• D1 = {(mesons, .8), (scattering, .4)}
• D2 = {(mesons, .5), (scattering, .6)}
• Query = MESONS AND SCATTERING
• RSV(D1) = MIN(.8,.4) = .4
• RSV(D2) = MIN(.5,.6) = .5
• D2 is ranked before D1 in the result set.
2011.02.07 - SLIDE 16IS 240 – Spring 2011
Fuzzy Sets
• The set membership function can be, for example, the relative term frequency within a document, the IDF or any other function providing weights to terms
• This means that the fuzzy methods use sets of criteria for term weighting that are the same or similar to those used in other ranked retrieval methods (e.g., vector and probabilistic methods)
2011.02.07 - SLIDE 17IS 240 – Spring 2011
Robertson’s Critique of Fuzzy Sets
• D1 = {(mesons, .4), (scattering, .4)}• D2 = {(mesons, .39), (scattering, .99)}• Query = MESONS AND SCATTERING• RSV(D1) = MIN(.4,.4) = .4• RSV(D2) = MIN(.39,.99) = .39• However, consistent with the Boolean
model:– Query = t1t2t3…t100
– If D not indexed by t1 then it fails, even if D is indexed by t2,…,t100
2011.02.07 - SLIDE 18IS 240 – Spring 2011
Robertson’s critique of Fuzzy
• Fuzzy sets suffer from the same kind of lack of discrimination among the retrieval results almost to the same extent as standard Boolean
• The rank of a document depends entirely on the lowest or highest weighted term in an AND or OR operation
2011.02.07 - SLIDE 19IS 240 – Spring 2011
Other Fuzzy Approaches
• As described in the Modern Information Retrieval (optional) text, a keyword correlation matrix can be used to determine set membership values, and algebraic sums and products can be used in place of MAX and MIN
• Not clear how this approach works in real applications (or in tests like TREC) because the testing has been on a small scale
2011.02.07 - SLIDE 20IS 240 – Spring 2011
Extended Boolean (P-Norm)
• Ed Fox’s Dissertation work with Salton• Basic notion is that terms in a Boolean
query, and the Boolean Operators themselves can have weights assigned to them
• Binary weights means that queries behave like standard Boolean
• 0 < Weights < 1 mean that queries behave like a ranking system
• The system requires similarity measures
2011.02.07 - SLIDE 21IS 240 – Spring 2011
Probabilistic Inclusion of Boolean
• Most probabilistic models attempt to predict the probability that given a particular query Q and document D, that the searcher would find D relevant
• If we assume that Boolean criteria are to be ANDed with a probabilistic query…
€
P(R | Q,D) = P(R | Qbool ,D)P(R | Qprob ,D)
P(R | Qbool ,D) =1: if Boolean eval successful for D
0 : Otherwise
⎧ ⎨ ⎩
2011.02.07 - SLIDE 22IS 240 – Spring 2011
Rubric – Extended Boolean
• Scans full text of documents and stores them• User develops a hierarchy of concepts which
becomes the query• Leaf nodes of the hierarchy are combinations of
text patterns• A “fuzzy calculus” is used to propagate values
obtained at leaves up through the hierarchy to obtain a single retrieval status value (or “relevance” value)
• RUBRIC returns a ranked list of documents in descending order of “relevance” values.
2011.02.07 - SLIDE 23IS 240 – Spring 2011
RUBRIC Rules for Concepts & Weights
• Team | event => World_Series• St._Louis_Cardinals | Milwaukee_Brewers =>
Team• “Cardinals” => St._Louis_Cardinals (0.7)• Cardinals_full_name => St._Louis_Cardinals
(0.9)• Saint & “Louis” & “Cardinals” =>
Cardinals_full_name• “St.” => Saint (0.9)• “Saint” => Saint• “Brewers” => Milwaukee_Brewers (0.5)
2011.02.07 - SLIDE 24IS 240 – Spring 2011
RUBRIC Rules for Concepts & Weights
• “Milwaukee Brewers” => Milwaukee_Brewers (0.9)
• “World Series” => event• Baseball_championship => event (0.9)• Baseball & Championship =>
Baseball_championship• “ball” => Baseball (0.5)• “baseball” => Baseball• “championship” => Championship (0.7)
2011.02.07 - SLIDE 25IS 240 – Spring 2011
RUBRIC combination methods
V(V1 or V2) = MAX(V1, V2)V(V1 and V2) = MIN(V1, V2)i.e., classic fuzzy matching,but with the addition…V(level n) = Cn*V(level n-1)
2011.02.07 - SLIDE 26IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0)
Event (0)
“World Series” Baseball_championship (0)
Baseball (0)
Championship (0)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (0) “championship” (0)“ball” (0)0.9
0.90.7 0.90.5
0.9
0.50.7
2011.02.07 - SLIDE 27IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0)
Event (0)
“World Series” Baseball_championship (0)
Baseball (0)
Championship (0)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
Document containing “ball”, “baseball” & “championship”
2011.02.07 - SLIDE 28IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0)
Event (0)
“World Series” Baseball_championship (0)
Baseball (1.0)
Championship (0.7)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
2011.02.07 - SLIDE 29IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0)
Event (0)
“World Series” Baseball_championship (0.7)
Baseball (1.0)
Championship (0.7)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
2011.02.07 - SLIDE 30IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0)
Event (0.63)
“World Series” Baseball_championship (0.7)
Baseball (1.0)
Championship (0.7)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
2011.02.07 - SLIDE 31IS 240 – Spring 2011
Rule Evaluation Tree
World_Series (0.63)
Event (0.63)
“World Series” Baseball_championship (0.7)
Baseball (1.0)
Championship (0.7)
St._Louis_Cardinals (0)
Team (0)
“Cardinals” (0)
Milwaukee_brewers (0)
Cardinals_full_name (0)
“Milwaukee Brewers” (0)“Brewers” (0)
Saint (0) “Louis” (0)
“Saint” (0)“St.” (0)
“Cardinals” (0)
“baseball” (1.0)“championship” (1.0)“ball” (1.0)0.9
0.90.7 0.90.5
0.9
0.50.7
2011.02.07 - SLIDE 32IS 240 – Spring 2011
Today
• Review – IR Models– The Boolean Model– Boolean implementation issues– Extended Boolean Approaches
• Vector Representation
• Term Weights
• Vector Matching
2011.02.07 - SLIDE 33IS 240 – Spring 2011
Non-Boolean IR
• Need to measure some similarity between the query and the document
• The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query
• We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level
2011.02.07 - SLIDE 34IS 240 – Spring 2011
Similarity Measures (Set-based)
|)||,min(|
||
||||
||
||||
||||
||2
||
21
21
DQ
DQ
DQ
DQ
DQDQ
DQ
DQ
DQ
∩×
∩∪∩+∩
∩ Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
Cosine Coefficient
Overlap Coefficient
Assuming that Q and D are the sets of terms associated with a Query and Document:
2011.02.07 - SLIDE 35IS 240 – Spring 2011
Document Vectors
• Documents are represented as “bags of words”
• Represented as vectors when used computationally– A vector is like an array of floating point– Has direction and magnitude– Each vector holds a place for every term in
the collection– Therefore, most vectors are sparse
2011.02.07 - SLIDE 36IS 240 – Spring 2011
Vector Space Model
• Documents are represented as vectors in “term space”– Terms are usually stems– Documents represented by binary or weighted vectors
of terms
• Queries represented the same as documents• Query and Document weights are based on
length and direction of their vector• A vector distance measure between the query
and documents is used to rank retrieved documents
2011.02.07 - SLIDE 37IS 240 – Spring 2011
Vector Representation
• Documents and Queries are represented as vectors
• Position 1 corresponds to term 1, position 2 to term 2, position t to term t
• The weight of the term is stored in each position
absent is terma if 0
,...,,
,...,,
21
21
=
=
=
w
wwwQ
wwwD
qtqq
dddi itii
2011.02.07 - SLIDE 38IS 240 – Spring 2011
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
2011.02.07 - SLIDE 39IS 240 – Spring 2011
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
2011.02.07 - SLIDE 40IS 240 – Spring 2011
Document Vectors + Frequency
ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
2011.02.07 - SLIDE 41IS 240 – Spring 2011
We Can Plot the Vectors
Star
Diet
Doc about astronomyDoc about movie stars
Doc about mammal behavior
2011.02.07 - SLIDE 42IS 240 – Spring 2011
Documents in 3D Space
Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning
2011.02.07 - SLIDE 43IS 240 – Spring 2011
Vector Space Documents and Queries
docs t1 t2 t3 RSV=Q.DiD1 1 0 1 4D2 1 0 0 1D3 0 1 1 5D4 1 0 0 1D5 1 1 1 6D6 1 1 0 3D7 0 1 0 2D8 0 1 0 2D9 0 0 1 3D10 0 1 1 5D11 0 0 1 4Q 1 2 3
q1 q2 q3
D1D2
D3
D4
D5
D6
D7D8
D9
D10
D11
t2
t3
t1
Boolean term combinationsQ is a query – also represented as a vector
2011.02.07 - SLIDE 44IS 240 – Spring 2011
Documents in Vector Space
t1
t2
t3
D1
D2
D10
D3
D9
D4
D7
D8
D5
D11
D6
2011.02.07 - SLIDE 45IS 240 – Spring 2011
Document Space has High Dimensionality
• What happens beyond 2 or 3 dimensions?
• Similarity still has to do with how many tokens are shared in common.
• More terms -> harder to understand which subsets of words are shared among similar documents.
• We will look in detail at ranking methods• Approaches to handling high
dimensionality: Clustering and LSI (later)
2011.02.07 - SLIDE 46IS 240 – Spring 2011
Today
• Review – IR Models– The Boolean Model– Boolean implementation issues– Extended Boolean Approaches
• Vector Representation
• Term Weights
• Vector Matching
2011.02.07 - SLIDE 47IS 240 – Spring 2011
Assigning Weights to Terms
• Binary Weights
• Raw term frequency
• tf*idf– Recall the Zipf distribution– Want to weight terms highly if they are
• Frequent in relevant documents … BUT• Infrequent in the collection as a whole
• Automatically derived thesaurus terms
2011.02.07 - SLIDE 48IS 240 – Spring 2011
Binary Weights
• Only the presence (1) or absence (0) of a term is included in the vector
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1D10 0 1 1D11 1 0 1
2011.02.07 - SLIDE 49IS 240 – Spring 2011
Raw Term Weights
• The frequency of occurrence for the term in each document is included in the vector
docs t1 t2 t3D1 2 0 3D2 1 0 0D3 0 4 7D4 3 0 0D5 1 6 3D6 3 5 0D7 0 8 0D8 0 10 0D9 0 0 1D10 0 3 5D11 4 0 1
2011.02.07 - SLIDE 50IS 240 – Spring 2011
Assigning Weights
• tf*idf measure:– Term frequency (tf)– Inverse document frequency (idf)
• A way to deal with some of the problems of the Zipf distribution
• Goal: Assign a tf*idf weight to each term in each document
2011.02.07 - SLIDE 51IS 240 – Spring 2011
Simple tf*idf
)/log(* kikik nNtfw =
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
⎟⎠⎞⎜
⎝⎛=
=====
nNidf
CnCNCidf
DtfDkT
kk
kk
kk
ikik
ik
2011.02.07 - SLIDE 52IS 240 – Spring 2011
Inverse Document Frequency
• IDF provides high values for rare words and low values for common words
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
=⎟⎠
⎞⎜⎝
⎛
For a collectionof 10000 documents(N = 10000)
2011.02.07 - SLIDE 53IS 240 – Spring 2011
Word Frequency vs. Resolving Power
The most frequent words are not the most descriptive.
(from van Rijsbergen 79)
2011.02.07 - SLIDE 54IS 240 – Spring 2011
Weighting schemes
• We have seen something of– Binary– Raw term weights– TF*IDF
• There are many other possibilities– IDF alone– Normalized term frequency– etc.
2011.02.07 - SLIDE 55IS 240 – Spring 2011
tf x idf normalization
• Normalize the term weights (so longer documents are not unfairly given more weight)– normalize usually means force all values to
fall within a certain range, usually between 0 and 1, inclusive.
∑ =
=t
k kik
kikik
nNtf
nNtfw
1
22 )]/[log()(
)/log(
2011.02.07 - SLIDE 56IS 240 – Spring 2011
Vector space similarity
• Use the weights to compare the documents
terms.) thehting when weigdone tion was(Normaliza
product.inner normalizedor cosine, thecalled also is This
),(
:is documents twoof similarity theNow,
1∑=
∗=t
kjkikji wwDDsim
2011.02.07 - SLIDE 57IS 240 – Spring 2011
Vector Space Similarity Measure
• combine tf x idf into a measure
)()(
),(
:comparison similarity in the normalize could weotherwise
),( :normalized are weights term theif
absent is terma if 0 ...,,
,...,,
1
2
1
2
1
1
,21
21
∑∑
∑
∑
==
=
=
∗
∗=
∗=
==
=
t
jd
t
jqj
t
jdqj
i
t
jdqji
qtqq
dddi
ij
ij
ij
itii
ww
wwDQsim
wwDQsim
wwwwQ
wwwD
2011.02.07 - SLIDE 58IS 240 – Spring 2011
Weighting schemes
• We have seen something of– Binary– Raw term weights– TF*IDF
• There are many other possibilities– IDF alone– Normalized term frequency
2011.02.07 - SLIDE 59IS 240 – Spring 2011
Term Weights in SMART
• SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell.
• Designed for laboratory experiments in IR – Easy to mix and match different weighting
methods– Really terrible user interface– Intended for use by code hackers
2011.02.07 - SLIDE 60IS 240 – Spring 2011
Term Weights in SMART
• In SMART weights are decomposed into three factors:
norm
collectfreqw kkd
kd
∗=
2011.02.07 - SLIDE 61IS 240 – Spring 2011
SMART Freq Components
⎪⎪⎪
⎭
⎪⎪⎪
⎬
⎫
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
+
+=
1)ln(
)max(2
1
2
1)max(
}1,0{
kd
kd
kd
kd
kd
kd
freq
freqfreq
freq
freq
freq
Binary
maxnorm
augmented
log
2011.02.07 - SLIDE 62IS 240 – Spring 2011
Collection Weighting in SMART
⎪⎪⎪⎪⎪
⎭
⎪⎪⎪⎪⎪
⎬
⎫
⎪⎪⎪⎪⎪
⎩
⎪⎪⎪⎪⎪
⎨
⎧
−
⎟⎟⎠
⎞⎜⎜⎝
⎛
=
k
k
k
k
k
k
Doc
Doc
DocNDocDoc
NDoc
Doc
NDoc
collect
1
log
log
log
2
Inverse
squared
probabilistic
frequency
2011.02.07 - SLIDE 63IS 240 – Spring 2011
Term Normalization in SMART
( )⎪⎪⎪
⎭
⎪⎪⎪
⎬
⎫
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
=∑∑∑
jvector
vectorj
vectorj
vectorj
w
w
w
w
norm
max
4
2
sum
cosine
fourth
max
2011.02.07 - SLIDE 64IS 240 – Spring 2011
Lucene Algorithm
• The open-source Lucene system is a vector based system that differs from SMART-like systems in the ways the TF*IDF measures are normalized
2011.02.07 - SLIDE 65IS 240 – Spring 2011
Lucene
• The basic Lucene algorithm is:
• Where is the length normalized query
– and normd,t is the term normalization (square root of the number of tokens in the same document field as t)
– overlap(q,d) is the proportion of query terms matched in the document
– boostt is a user specified term weight enhancement
∑ ⋅⎟⎟⎠
⎞⎜⎜⎝
⎛⋅
⋅⋅
⋅=
tt
td
ttd
q
ttq
q
dqoverlapboost
norm
idftf
norm
idftfdqScore
),(),(
,
,,
q
ttq
norm
idftf ⋅,
2011.02.07 - SLIDE 66IS 240 – Spring 2011
How To Process a Vector Query
• Assume that the database contains an inverted file like the one we discussed earlier…– Why an inverted file?– Why not a REAL vector file?
• What information should be stored about each document/term pair?– As we have seen SMART gives you choices
about this…
2011.02.07 - SLIDE 67IS 240 – Spring 2011
Simple Example System
• Collection frequency is stored in the dictionary
• Raw term frequency is stored in the inverted file postings list
• Formula for term ranking
⎟⎟⎠
⎞⎜⎜⎝
⎛⋅=
⋅=∑=
kkk
M
kikqki
n
Ntfw
wwDQsim
log
),(1
2011.02.07 - SLIDE 68IS 240 – Spring 2011
Processing a Query
• For each term in the query– Count number of times the term occurs – this
is the tf for the query term– Find the term in the inverted dictionary file
and get:• nk : the number of documents in the collection with
this term• Loc : the location of the postings list in the inverted
file• Calculate Query Weight: wqk • Retrieve nk entries starting at Loc in the postings
file
2011.02.07 - SLIDE 69IS 240 – Spring 2011
Processing a Query
• Alternative strategies…– First retrieve all of the dictionary entries
before getting any postings information• Why?
– Just process each term in sequence
• How can we tell how many results there will be? – It is possible to put a limitation on the number
of items returned• How might this be done?
2011.02.07 - SLIDE 70IS 240 – Spring 2011
Processing a Query
• Like Hashed Boolean OR:– Put each document ID from each postings list into hash table
• If match increment counter (optional) – If first doc, set a WeightSUM variable to 0
• Calculate Document weight wik for the current term
• Multiply Query weight and Document weight and add it to WeightSUM
• Scan hash table contents and add to new list – including document ID and WeightSUM
• Sort by WeightSUM and present in sorted order