| 1 › gertjan van noord2014 zoekmachines lecture 4
TRANSCRIPT
Chapter 6 Overview
Scoring
•Parametric and zone indexing•Weighted zone scoring (skip 6.1.2 and 6.1.3)•Term frequency and weighting•Vector model, queries as vectors•Computing scores• (skip 6.4)
Formulas and symbols
•Ʃ : sum ∏ : product
•√ : (square) root (base 2) √x = y : y2 = x
√1=1 √4=2 √10=3,16.. √100=10
•log: logarithm (base 10) logx = y : 10y = x
log1=0 log4=0,6.. log10=1 log100=2
∑i=1
n
a [ i ] ∏i=1
n
a [ i ]
sum/ multiply array elements
numbers increase slowly
• |x| stands for number of elements of x
Parametric indexes
To a digital document, metadata can be attachedfields: format, date, author, title, keywords, …Dublin core: set of metadata fields for digital
library
Extra parametric indexes built for these fieldseach field its own - limited - dictionary
The user interface should accommodate querying of metadata next to text queries
The system must combine the query parts for retrieval
Zone indexes
In the (free) text of a document itself, different zones can be recognized:
• Author• Title• date of publ• Abstract• Keywords• Body• …
Difference between parametric index and zone index
Zone indexes
– If documents have a common outline, zones can be separated in the index, in 2 ways:
1.different index terms: william.author, william.body
2.different postings (each doc zone):william -> [1.author -> 2.author, 2.body -> …]
Again: user interface and retrieval must be adapted
Weighted zone scoring
Zones can be used by the system for a ranked Boolean retrieval (without zone queries)
With a query word in the title the doc is more relevant
Each zone gets a partial weight (sum=1)
Score of each zone of a doc for a query is Boolean score (0/1) * zone weight
The score of the doc for the query is the sum of its zone scores (0 <= docscore <= 1)), so docs can be ordered
Example• Query: pie AND cream•1:title 2:abstract g1= 0.6, g2 = 0.4, s=AND
• Doc score:
• Document D1 g s sc
1.Title apple pie 0.6 * 0 = 0
2. Abstract pie cream 0.4 * 1 = 0.4
Total score (D1, q) 0.4
• What if one word in title, other in abstract?
∑i=1
l
gisi
Summary simple boolean model
• only store presence of a term in a document • conjunctive query
select docs that contain all query terms• disjunctive query
select docs that contain at least one of the query terms
• arbitrary boolean queryselect docs in which the query term distribution satisfies the query, using the operators AND, OR and NOT
• binary decision: documents are selected or not, no ranking
How can we improve the system?We need methods to rank the results
containing some or all query terms
Is there more than presence/absence?
Is each term equally important?
Vector model• terms in a document (and a query) receive a
score• scores are combined to calculate for each
document the similarity with the query • using the similarity scores, the result set of
documents containing one or more query terms can be ranked
How can we calculate a score for a term?For a term in a document? A term in general?
Documents as vectorsDocuments presented as vector of weights for
each term of the dictionary
V (Doc1) = (8, 2, 2, 10, 2, ...)only few of the dictionary terms shown belowusing term frequencies as weights,
which document scores best for query: food of ape?
ape child food of panther
Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0
15 of 29
Advanced Scores for Terms in Documents• tft,d : term frequency of t in d: how many
tokens of term t in document d• dft : document frequency of t:how many
documents with term t?• dft /N : relative document frequency of
t:how many docs with term t / total of docs (N)?
BUT: high document freq => important term?
• idft : inverse document frequency of t: N/dft, if dft is higher, the inverse is lower
• tft,d∙idft: score for a term in a document
Score for a query – doc combinationscore ( q,d )=∑
t∈qtf t,d× idf t
q query d document each term t that is an element of query qtft,d term frequency of term t in doc d
idft inverse document frequency of t: log N/dft
N number of documents in collection
t∈q
Example
assume: idfape=5 idffood=2 idfof=0,001
score(“food of ape”, Doc1) = score(“food of ape”, Doc2) =
ape child food of panther
Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0
Example
score(“food of ape”, Doc1) = 8*5 + 2*2 + 0 = 44
score(“food of ape”, Doc2) = 1*5 + 9*2 + 0 = 23
ape child food of panther
Doc1 8 2 2 10 2Doc2 1 5 9 20 0Docn .. .. .. .. ..Query 1 0 1 1 0
score ( q,d )=∑t∈qtf t,d× idf t
19 of 29
What about the query terms?
The query vector can also have weights, then we multiply the 2 weights for each term:
Often not the same weighting, so in general
score ( q,d )=∑t∈q
( tf t,q× idf t ) .( tf t,d× idf t )
score ( q,d )=∑t∈q
( w t,q ) .(w t,d )
20 of 29
• Problem: current approach does not take into account document length
• Solution: use vector term weights instead of term frequencies
• For vector length: Euclidean length
√∑i=1
n
V⃗i 2
Length of a vector V with n elements =the square root of (the sum of the squares of all the elements)
Example: compute normalized frequencies for Figure 6.9
Next couple of sheets from:http://www.stanford.edu/class/cs276/
Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine
These are very sparse vectors - most entries are zero.
Sec. 6.3
Queries as vectors
Key idea 1: Do the same for queries: represent them as vectors in the space
Key idea 2: Rank documents according to their proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model.
Instead: rank more relevant documents higher than less relevant documents
Sec. 6.3
Formalizing vector space proximity
First cut: distance between two points
( = distance between the end points of the two vectors)Euclidean distance?
Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for vectors of different lengths.
Sec. 6.3
Why distance is a bad idea
The Euclidean distance between q and d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are
very similar.
Sec. 6.3
Use angle instead of distance
Thought experiment: take a document d and append it to itself. Call this document d′.
“Semantically” d and d′ have the same content
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal similarity.
Key idea: Rank documents according to angle with query.
Sec. 6.3
Length normalization
A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L2 norm:
Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)
Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
Long and short documents now have comparable weights
‖x⃗‖2=√∑ x i2
Sec. 6.3
cosine(query,document)
cos ( q⃗ , d⃗ )= q⃗⋅d⃗|q⃗||⃗d|
= q⃗|⃗q|
⋅ d⃗|⃗d|
=∑ q id i
√∑ q i2√∑ d i
2
Dot product Unit vectors
qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.
Sec. 6.3
Cosine for length-normalized vectors
For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):
for q, d length-normalized.
29
cos ( q⃗ , d⃗ )=q⃗ · d⃗=∑ q id i