chapter 2 information retrieval

Slide 1

Chapter 2Information RetrievalMs. Malak Bagais[textbook]: Chapter 21ObjectivesBy the end of this lecture, student will be able to:

Lists information retrieval componentsDescribe document representationApply Porters AlgorithmCompare and apply different retrieval modelsEvaluate the performance of retrieving

2Information Retrieval

summarizationsearchingindexing3Information Retrieval Components4Document Representation

Transforming a text document to a weighted list of keywords5StopwordsIhowwasainwereaboutiswhatanitwhenarelawhereasofwhoatonwhybeorwillbythatwontcomthewithdetheirwithinentherewithoutforthisundfromtowww6Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges.Sample Document7adalgorithmsanalysisanalyzearchivesartificialassimilateavailabilitybillionbitschallengescommercialcomplexitycomputationalcomputingconceptdatadatadatadatadatadatadatadatadatabasedatabasesdatabasesdetectdistributeddrivingdynamicefficientlyemergedenterprisesexcessexcitingexpectedfamilyfieldfieldsforceformhiddenhighhocinformationintelligenceinterestinginterestingintersectionlargelargelylearningliesmachinemanagemarketminingminingminingminingminingnaturenuggetsonlinepatternpetabyte-scalepotentiallypracticepresencepresentquickrecognitionrecognizerefersrelationshipssciencesizesoftwarespanstatisticsstoresystemstechniquestechniquestechniquestheoreticalthrusttimeunderpinningsvaluableyearsDelete stopwords8StemmingA given word may occur in a variety of syntactic forms

pluralspast tensegerund formsConnect9StemmingA stem is what is left after its affixes (prefixes and suffixes) are removed11Letters A, E, I, O, and U are vowelsA consonant in a word is a letter other than A, E, I, O, or U, with the exception of YThe letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonantFor example, Y in synopsis is a vowel, while in toy, it is a consonantA consonant in the algorithm description is denoted by c, and a vowel by vPorters Algorithm12Porters Algorithmm is the measure of vc repetition

*S the stem ends with S (Similarly for other letters)*v* - the stem contains a vowel*d the stem ends with a double consonant (e.g., -TT)*o the stem ends cvc, where the seconds c is not W, X, or Y (e.g. -WIL)

OATSOATSm=1

13What is the value of m in the following words?Porters AlgorithmBYPRIVATEOATENORRERYIVYTROUBLESTREESTROUBLEOATSYTREEEETRShow all14Porters algorithm Step 1

Step 1:plurals and past participles16

Steps 24: straightforward stripping of suffixesPorters algorithm - Step 217

Steps 24: straightforward stripping of suffixesPorters algorithm Step 318

Steps 24: straightforward stripping of suffixesPorters algorithm Step 419ExamplegeneralizationsStep1: GENERALIZATIONStep2: GENERALIZEStep3: GENERALStep4: GENER

OSCILLATORSStep1: OSCILLATORStep2: OSCILLATEStep4: OSCILLStep5: OSCILNumber of words reduced in step1:35972:7663:3274:24245:1373Number of words not reduce:3650

In an experiment reported on Porters site, suffix stripping of a vocabulary of 10,000 words

http://www.tartarus.org/~martin/Porters Algorithm21Term-document matrix (TDM) is a two-dimensional representation of a document collection.Rows of the matrix represent various documentsColumns correspond to various index termsValues in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row).

Term-Document Matrix22

Term-Document matrix

Sparse Matrixes- triples

Sparse Matrixes- PairsRaw frequency values are not useful for a retrieval modelPrefer normalized weights, usually between 0 and 1, for each term in a documentDividing all the keyword frequencies by the largest frequency in the document is a simple method of normalizationNormalization

26Normalized Term-Document Matrix

Vector Representation of the sample document showing the terms, their frequencies and normalized frequenciesVector Representationad 1 0.125algorithm 1 0.125analysi 1 0.125analyz 1 0.125archiv 1 0.125artifici 1 0.125assimil 1 0.125avail 1 0.125billion 1 0.125bit 1 0.125challeng 1 0.125commerci 1 0.125complex 1 0.125comput 2 0.25concept 1 0.125data 8 1.00databas 3 0.375detect 1 0.125distribut 1 0.125drive 1 0.125dynam 1 0.125effici 1 0.125emerg 1 0.125enterpris 1 0.125excess 1 0.125excit 1 0.125expect 1 0.125famili 1 0.125field 2 0.25forc 1 0.125form 1 0.125hidden 1 0.125high 1 0.125hoc 1 0.125inform 1 0.125intellig 1 0.125interest 2 0.25intersect 1 0.125knowledg 1 0.125larg 2 0.25learn 1 0.125li 1 0.125machin 1 0.125manag 1 0.125market 1 0.125mine 5 0.62natur 1 0.125nugget 1 0.125onlin 1 0.125pattern 1 0.125petabyte 1 0.125potenti 1 0.125practic 1 0.125presenc 1 0.125present 1 0.125quick 1 0.125recogn 1 0.125recognit 1 0.125refer 1 0.125Relationship 1 0.125scienc 1 0.125size 1 0.125softwar 1 0.125span 1 0.125statist 1 0.125store 1 0.125system 1 0.125techniqu 3 0.375theoret 1 0.125thrust 1 0.125time 1 0.125underpin 1 0.125valuabl 1 0.125year 1 0.125

28Retrieval models match query with documents to:separate documents into relevant and non-relevant classrank the documents according to the relevance

Retrieval models29One of the simplest and most efficient retrieval mechanismsBased on set theory and Boolean algebraConventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a documentIn the term-document matrix replace all the nonzero values with 1Boolean Retrieval Model30Boolean Term-document Matrix

Document setDocSet(K0) = {D1,D3,D5}DocSet(K4) = {D2,D3,D4,D6}

QueryK0 and K4DocSet(K0) DocSet(K4) = {D3}

K0 or K4DocSet(K0) DocSet(K4) = {D1,D2,D3,D4,D5,D6}

Examples32Examples

33User Boolean queries are usually simple Boolean expressionsA Boolean query can be represented in a disjunctive normal form (DNF)disjunction corresponds to or conjunction refers to andDNF consists of a disjunction of conjunctive Boolean expressions

Boolean QueryK0 or (not K3 and K5) is in DNFDNF query processing can be very efficientIf any one of the conjunctive expressions is true, the entire DNF will be trueShort-circuit the expression evaluationStop matching the expression with a document as soon as a conjunctive expression matches the document; label the document as relevant to the query

DNF formSimplicity and efficiency of implementationBinary values can be stored using bitsreduced storage requirementsretrieval using bitwise operations is efficientBoolean retrieval was adopted by many commercial bibliographic systemsBoolean queries are akin to database queriesBoolean Model Advantages36A document is either relevant or non-relevant to the queryIt is not possible to assign a degree of relevanceComplicated Boolean queries are difficult for usersBoolean queries retrieve too few or too many documentsK0 and K4 retrieved only 1 out of 6 documentsK0 or K4 retrieved 5 out of a possible 6 documentsBoolean Model Disadvantages37Treats both the documents and queries as vectors

A weight based on the frequency in the document:Vector Space Model

38Graphical representation of the VSM Model

Graphical representation of the VSM Model

40Computing the similarity

Relevance Values and RankingRanking based on the similarityD0 (0.7774)D6 (0.4953)D2 (0.3123)D1 (0.2590)D5 (0.2122)D4 (0.1727)D3 (0.1084)42Variations of the normalized frequencyInverse document frequency (idf)The idf for the jth term:

N = no. of documentsnj = no. of documents containing jth termModified weights :

Variations of VSM

43

Inverse Document Frequencies44TDM using idf

Similarity and ranking using idfRanking based on the similarityD0 (0.7867)D6 (0.4953)D2 (0.3361)D1 (0.2590)D5 (0.2215)D4 (0.1208)D3 (0.0969)

46Queries are easier to express: allow users to attach relative weights to termsA descriptive query can be transformed to a query vector similar to documentsMatching between a query and a document is not precise: document is allocated a degree of similarityDocuments are ranked based on their similarity scores instead of relevant/non-relevant classesUsers can go through the ranked list until their information needs are met.VSM vs. Boolean47Evaluation should include:

FunctionalityResponse timeStorage requirementAccuracy

Evaluation of Retrieval Performance48Early days:Batch testingDocument collection such as cacm.allQuery collection such as query.text

Present day: interactive tests are usedDifficult to conduct and time consuming

Batch testing still important

Accuracy Testing49Precision and Recall

PrecisionHow many from the retrieved are relevant?RecallHow many from the relevant are retrieved?50Example

51F-measure

Three retrieved document was arbitrary

Average Precision

Relationship between precision and recall

chapter 2 information retrieval

Documents

data analysis

new thrust of data mining

information retrievalms

valuable bits of information

family of techniques

consonanta consonant

vowelsa consonant

stem conta