chapter 2 information retrieval

1

Chapter 2Information Retrieval

Chapter2 in the textbookSections: 2.1, 2.2 (2.2.1, 2.2.2), 2.3 (2.3.1, 2.3.2, 2.3.3), 2.4(2.4.1, 2,4,2)

2

Modern Information Retrieval Document representation

Using keywords Relative weight of keywords

Query representation Keywords Relative importance of keywords

Retrieval model Similarity between document and query

Rank the documents Performance evaluation of the retrieval

process

3

Document Representation

Transforming a text document to a weighted list of keywords

4

Stopwords

Figure 2.2 A partial list of stopwords

5

Sample DocumentData Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges.

6

List of words in d1 after deleting stopwords

7

StemmingA given word may occur in a variety of syntactic forms

plurals past tense gerund forms (a noun derived from a verb)

ExampleThe word connect, may appear as

connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.

8

StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,

connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect

9

Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,

I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a

consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,

it is a consonant A consonant in the algorithm description is

denoted by c, and a vowel by v

10

Porter’s Algorithm m is the measure of vc repetition

m = 0 TR, EE, TREE, Y, BY m = 1 TROUBLE, OATS, TREES, IVY m = 2 TROUBLES, PRIVATE, OATEN, ORRERY

*S – the stem ends with S (Similarly for other letters) *v* - the stem contains a vowel *d – the stem ends with a double consonant (e.g., -TT) *o – the stem ends cvc, where the seconds c is not W, X,

or Y (e.g. -WIL)

11

Porter’s algorithmStep 1

Step 1:plurals and past participles

12

Porter’s algorithm - Step 2

Steps 2–4: straightforward stripping of suffixes

13



14



15


Steps 5: tidying-up

16

Example generalizations

Step1: GENERALIZATION Step2: GENERALIZE Step3: GENERAL Step4: GENER

OSCILLATORS Step1: OSCILLATOR Step2: OSCILLATE Step4: OSCILL Step5: OSCIL

17

Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)

Porter’s algorithm

18

Document Representation

19

Term-Document Matrix• Term-document matrix (TDM) is a two-

dimensional representation of a document collection.

• Rows of the matrix represent various documents

• Columns correspond to various index terms• Values in the matrix can be either the

frequency or weight of the index term (identified by the column) in the document (identified by the row).

20

Term-Document matrix

21

Sparse Matrixes- triples

22

Sparse Matrixes- Pairs

23

Normalization• raw frequency values are not useful for a

retrieval model• prefer normalized weights, usually between

0 and 1, for each term in a document• dividing all the keyword frequencies by the

largest frequency in the document is a simple method of normalization:

24

Normalized Term-Document Matrix

25

Vector Representation of document d1

(word, frequency, normalized frequency)

26

Retrieval modelsRetrieval models match query with documents to:

separate documents into relevant and non-relevant class

rank the documents according to the relevance

27

Retrieval modelsBoolean modelVector space model (VSM)Probabilistic models

28

Boolean Retrieval Model

29

Boolean Retrieval Model One of the simplest and most efficient

retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false

as 0 and true as 1 Boolean model is interested only in the

presence or absence of a term in a document In the term-document matrix replace all the

nonzero values with 1

30

Boolean Term-document Matrix

31

ExampleDocument set DocSet(K0) = {D1,D3,D5} DocSet(K4)={D2,D3,D4,D6}Query K0 and K4

K0 or K4

32

K0 or (not K3 and K5)

33

Boolean Query User Boolean queries are usually

simple Boolean expressions A Boolean query can be represented

in a “disjunctive normal form” (DNF) disjunction corresponds to or conjunction refers to and DNF consists of a disjunction of

conjunctive Boolean expressions

34

DNF form K0 or (not K3 and K5) is in DNF DNF query processing can be very

efficient If any one of the conjunctive expressions

is true, the entire DNF will be true Short-circuit the expression evaluation Stop matching the expression with a

document as soon as a conjunctive expression matches the document; label the document as relevant to the query

35

Boolean ModelAdvantages

Simplicity and efficiency of implementation Binary values can be stored using bits

reduced storage requirements retrieval using bitwise operations is efficient

Boolean retrieval was adopted by many commercial bibliographic systems

Boolean queries are akin to database queries

36

Boolean Model Disadvantages A document is either relevant or non-relevant to

the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for

users Boolean queries retrieve too few or too many

documents. K0 and K4 retrieved only 1 out of 6 documents K0 or K4 retrieved 5 out of a possible 6 documents

37

Vector Space Model (VSM)

38

Vector Space Model Treats both the documents and queries

as vectors A weight based on the frequency in the

document:

39

Graphical representation of the VSM Model

41

Computing the similarity

42

Relevance Values and Ranking

RankingD0 (0.7774)D6 (0.4953)D2 (0.3123)D1 (0.2590)D5 (0.2122)D4 (0.1727)D3 (0.1084)

43

Variations of VSM Variations of the normalized frequency Inverse document frequency (idf) N = no. of documents nj = no. of documents containing jth term Modified weights :

44

Inverse Document Frequencies for Collection (normalized)

0 1 2 37log 0.3683

idf idf idf idf

4 5 67log 0.2434

idf idf idf

45

TDM using idf

46

)0,3.0,2.0,0,6.0,2.0,0(q

RankingD0 (0.7867)D6 (0.4953)D2 (0.3361)D1 (0.2590)D5 (0.2215)D4 (0.1208)D3 (0.0969)

47

VSM vs. Boolean Queries are easier to express: allow users to

attach relative weights to terms A descriptive query can be transformed to a

query vector similar to documents Matching between a query and a document is

not precise: document is allocated a degree of similarity

Documents are ranked based on their similarity scores instead of relevant/non-relevant classes

Users can go through the ranked list until their information needs are met.

48

Evaluation of Retrieval Performance

49

Evaluation of Retrieval Performance Evaluation should include:

FunctionalityResponse timeStorage requirementAccuracy

50

Accuracy TestingEarly days:

Batch testing Document collection such as cacm.all Query collection such as query.text

Present day: interactive tests are used Difficult to conduct and time consuming

Batch testing still important

51

Precision and Recall

Precision How many from the retrieved are relevant?

Recall How many from the relevant are retrieved?

52

Our earlier example illustrating the VSM o Documents from Fig. 2.15 o query )0,3.0,2.0,0,6.0,2.0,0(q

Ranking 1. D0* 2. D6 3. D2* 4. D1 5. D5* 6. D4 7. D3*

Semantic analysis: documents with asterisk as relevant Retrieved the three top ranked documents Relevant documents: {D0, D2, D5, D3}R Retrieved documents: {D0, D6,D2}A {D0, D2}R A

{D0,D2} 2 0.67{D0,D6,D2} 3

R Aprecision

A

{D0,D2} 2 0.5{D0,D2,D5,D3} 4

R Arecall

R

53

F-measure2

2

precision recall precision recallFprecision recall precision recall

2 2 0.67 0.5 0.67 0.570.67 0.5 1.17

precision recallFprecision recall

54

Average Precision

Three retrieved document was arbitraryRank retrieved Precision Recall

1 1.00 0.25 2 0.50 0.25 3 0.67 0.50 4 0.50 0.50 5 0.60 0.75 6 0.50 0.75 7 0.57 1.00

55

Relationship between precision and recall

56

Average Precision

1

( ) ( )Average Precision =

N

i

precision i relevance i

R

1.00 1 0.50 0 0.67 1 0.50 0 0.60 1 0.50 0 0.57 1Average Precision =4

2.84 4

0.71