1 cs 430: information discovery lecture 2 introduction to text based information retrieval

1

CS 430: Information Discovery

Lecture 2

Introduction to Text Based Information Retrieval

2

Course Administration

• Please send all questions about the course to:

[email protected]

The message will be sent to

[email protected] (Bill Arms)[email protected] (Manpreet Singh )[email protected] (Sid Anand)[email protected] (Martin Guerrero)

3


Programming in Perl

Assignments 2, 3 and 4 require programs to be written in Perl.

An introduction to programming in Perl will be given at 7:30 p.m. on Wednesdays September 19 and October 3.

These classes are optional. There will not be regular discussion classes on these dates.

Materials about Perl and further information about these classes will be posted on the course web site.

4


Discussion class, Wednesday, September 4

Read and be prepared to discuss:

Harman, D., Fox, E., Baeza-Yates, R.A., Inverted files. (Frakes and Baeza-Yates, Chapter 3)

Phillips Hall 101, 7:30 to 8:30 p.m.

5

Classical Information Retrieval

media type

text image, video, audio, etc.

searching browsing

linking

statistical user-in-loopcatalogs, indexes (metadata)

CS 502

natural language

processing

CS 474

6

Recall and Precision

If information retrieval were perfect ...

Every hit would be relevant to the original query, and every relevant item in the body of information would be found.

Precision: percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query.

Recall: percentage of the relevant items that are found by the query, the extent to which the query found all the

items that satisfy the requirement.

7

Recall and Precision: Example

• Collection of 10,000 documents, 50 on a specific topic

• Ideal search finds these 50 documents and reject others

• Actual search identifies 25 documents; 20 are relevant but 5 were on other topics

• Precision: 20/ 25 = 0.8

• Recall: 20/50 = 0.4

8

Measuring Precision and Recall

Precision is easy to measure:

• A knowledgeable person looks at each document that is identified and decides whether it is relevant.

• In the example, only the 25 documents that are found need to be examined.

Recall is difficult to measure:

• To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria.

• In the example, all 10,000 documents must be examined.

9

Relevance and Ranking

Precision and recall assume that a document is either relevant to a query or not relevant.

Often a user will consider a document to be partially relevant.

Ranking methods: measure the degree of similarity between a query and a document.

Requests DocumentsSimilar

Similar: How similar is document to a request?

10

Documents

A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation.

The individual words and other symbols are known as tokens or terms.

A textual document can be:

• Free text, also known as unstructured text, which is a continuous sequence of tokens.

• Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.

[Methods of markup, e.g., XML, are covered in CS 502.]

11

Word Frequency

Observation: Some words are more common than others.

Statistics: Most large collections of text documents have similar statistical characteristics. These statistics:

• influence the effectiveness and efficiency of data structures used to index documents

• many retrieval models rely on them

The following example is taken from:

Jamie Callan, Characteristics of Text, 1997 http://hobart.cs.umass.edu/~allan/cs646-f97/char_of_text.html

12

Rank Frequency Distribution

For all the words in a collection of documents, for each word w

f(w) is the frequency that w appears

r(w) is rank of w in order of frequency, e.g., the most commonly occurring word has rank 1

f

r

w has rank r and frequency f

13

f f f

the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 93515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company83070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925

14

Zipf's Law

If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation:

r(w) * f(w) = c

Different collections have different constants c.

In English text, c tends to be about n / 10, where n is the number of distinct words in the collection.

For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see:

Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

15

1000*rf/n 1000*rf/n 1000*rf/n

the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

16

Luhn's Proposal

"It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements."

Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, 159-165 (1958)

17

Methods that Build on Zipf's Law

Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less.

Stop lists: Ignore the most frequent words (upper cut-off)

Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off)

18

Cut-off Levels for Significance Words

f

r

Upper cut-off

Lower cut-off

Resolving power of significant words

Significant words

from: Van Rijsbergen, Ch. 2

19

Approaches to Weighting

Boolean information retrieval:

Weight of term i in document j:

w(i, j) = 1 if term i occurs in document jw(i, j) = 0 otherwise

Vector space methods

Weight of term i in document j:

0 < w(i, j) <= 1 if term i occurs in document jw(i, j) = 0 otherwise

20

Functional View of Information Retrieval

Requests Documents

Index database

Similar: mechanism for determining the similarity of the request representation to the information item representation.

21

Major Subsystems

Indexing subsystem: Receives incoming documents, converts them to the form required for the index and adds them to the index database.

Search subsystem: Receives incoming requests, converts them to the form required for searching the index and searches the database for matching documents.

The index database is the central hub of the system.

22

Example: Indexing Subsystem

Documents

break into words

stoplist

stemming*

term weighting*

Index database

text

non-stoplist words

words

stemmed words

terms with weights

*Indicates optional operation.

from Frakes, page 7

assign document IDsdocuments

document numbers

and *field numbers

23

Example: Search Subsystem

Index database

query parse query

stemming*stemmed words

stoplist non-stoplist words

query terms

Boolean operations

ranking*

relevance judgments*

relevant document set

ranked document set

retrieved document set

*Indicates optional operation.

1 cs 430: information discovery lecture 2 introduction to text based information retrieval

Documents

recall precision

relevant items

ranking precision

textual document

measuring precision

martin guerrero slide

topics precision

original query