cs 430: information discovery

1

CS 430: Information Discovery

Lecture 4

Ranking

2

Course Administration

• The slides for Lecture 3 have been reposted with slightly revised notation.

• The reading for Discussion Class 2 requires a computer connected to the network with a Cornell IP address.

• Teaching assistants do not have office hours. If your query cannot be addressed by email, ask to meet with them or come to my office hours.

• Assignment 1 is an individual assignment. Discuss the concepts and the choice of methods with your colleagues, but the actual programs and report much be individual work.

3

Choice of Weights

ant bee cat dog eel fox gnu hog

q ? ? d1 ? ? d2 ? ? ? ? d3 ? ? ? ? ?

queryq ant dogdocument text termsd1 ant ant bee ant bee

d2 dog bee dog hog dog ant dog ant bee dog hog

d3 cat gnu dog eel fox cat dog eel fox gnu

What weights lead to the best information retrieval?

4

Methods for Selecting Weights

Empirical

Test a large number of possible weighting schemes with actual data. (This lecture, based on work of Salton, et al.)

Model based

Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.)

5

Weighting1. Term Frequency

Suppose term j appears fij times in document i. What weighting should be given to a term j?

Term Frequency: Concept

A term that appears many times within a document is likely to be more important than a term that appears only once.

6

Term Frequency: Free-text Document

Length of document

Simple method (as illustrated in Lecture 3) is to use fij as the term frequency.

...but, in free-text documents, terms are likely to appear more often in long documents. Therefore fij should be scaled by some variable related to document length.

i

7

Term Frequency: Free-text Document

Standard method for free-text documents

Scale fij relative to the frequency of other terms in the document. This partially corrects for variations in the length of the documents.

Let mi = max (fij) i.e., mi is the maximum frequency of any term in document i

Term frequency (tf):

tfij = fij / mi when fij > 0

Note: There is no special justification for taking this form of term frequency except that it works well in practice and is easy to calculate.

i

8

Weighting2. Inverse Document Frequency

Suppose term j appears fij times in document i. What weighting should be given to a term j?

Inverse Document Frequency: Concept

A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

9

Inverse Document Frequency

Suppose there are n documents and that the number of documents in which term j occurs is nj.

A possible method might be to use n/nj as the inverse document frequency.

Standard method

The simple method over-emphasizes small differences. Therefore use a logarithm.

Inverse document frequency (idf):

idfj = log2 (n/nj) + 1 nj > 0

Note: There is no special justification for taking this form of inverse document frequency except that it works well in practice and is easy to calculate.

10

Example of Inverse Document Frequency

Example

n = 1,000 documents

term j nj idfj

A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00

From: Salton and McGill

11

Full Weighting: Standard Form of tf.idf

Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances:

(weight of term j in document i)

= (term frequency) * (inverse document frequency)

The standard tf.idf weighting scheme, for free text documents, is:

tij = tfij * idfj

= (fij / mi) * (log2 (n/nj) + 1) when nj > 0

12

Structured Text

Structured text

Structured texts, e.g., queries or catalog records, have different distribution of terms from free-text. A modified expression for the term frequency is:

tfij = K + (1 - K)*fij / mi when fij > 0

K is a parameter between 0 and 1 that can be tuned for a specific collection.

Query

To weigh terms in the query, Salton and Buckley recommend K equal to 0.5.

i

13

Similarity

The similarity between query q and document i is given by:

tqktik

|dq| |di|

Where dq and di are the corresponding weighted term vectors, with components in the k dimension (corresponding to term k) given by:

tqk = (0.5 + 0.5*fqk / mq)*(log2 (n/nk) + 1) when fqk > 0

tik = (fik / mi) * (log2 (n/nk) + 1) when fik > 0

k=1

n

cos(dq, di) =

14

Boolean Queries

Boolean query: two or more search terms, related by logical operators, e.g.,

and or not

Examples:

abacus and actor

abacus or actor

(abacus and actor) or (abacus and atoll)

not actor

15

Boolean Diagram

A B

A and B

A or B

not (A or B)

16

Adjacent and Near Operators

abacus adj actor

Terms abacus and actor are adjacent to each other as in the string

"abacus actor"

abacus near 4 actor

Terms abacus and actor are near to each other as in the string

"the actor has an abacus"

Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

17

Evaluation of Boolean Operators

Precedence of operators must be defined:

adj, near high

and, not

or low

Example

A and B or C and B

is evaluated as

(A and B) or (C and B)

18

Inverted File

Inverted file:

A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?"

In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

19

Inverted File -- Basic Concept

Word Document

abacus 3

19 22

actor 2 19 29

aspen 5 atoll 11

34

Stop words are removed and stemming carried out before building the index.

20

Inverted List -- Concept

Inverted List: All the entries in an inverted file that apply to a specific word, e.g.

abacus 3 19 22

Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

21

Evaluating a Boolean Query

3 19 22 2 19 29

To evaluate the and operator, merge the two inverted lists

with a logical AND operation.

Examples: abacus and actor

Postings for abacus

Postings for actor

Document 19 is the only document that contains both terms, "abacus" and "actor".

22

Enhancements to Inverted Files -- Concept

Location: The inverted file holds information about the location of each term within the document.

Uses

adjacency and near operatorsuser interface design -- highlight location of search term

Frequency: The inverted file includes the number of postings for each term.

Uses

term weightingquery processing optimization

23

Inverted File -- Concept (Enhanced)

Word Postings Document Location

abacus 4 3 94 19 7 19 212

22 56actor 3 2 66

19 213 29 45

aspen 1 5 43atoll 3 11 3

11 70 34 40

24

Evaluating an Adjacency Operation

Examples: abacus adj actor

Postings for abacus

Postings for actor

Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

3 94 19 719 212 22 56

2 66 19 213 29 45

cs 430: information discovery

Documents