web search & information retrieval. web search engines rooted in information retrieval (ir)...

Web Search & Web Search & Information RetrievalInformation Retrieval

Web search enginesWeb search enginesRooted in Information Retrieval (IR) systemsRooted in Information Retrieval (IR) systems

Prepare a keyword index for corpus Prepare a keyword index for corpus Respond to keyword queries with a ranked list of Respond to keyword queries with a ranked list of documents.documents.

ARCHIEARCHIEEarliest application of rudimentary IR systems to Earliest application of rudimentary IR systems to the Internetthe InternetTitle search across sites serving files over FTPTitle search across sites serving files over FTP

Mining the WebMining the Web Chakrabarti and RamakrishnanChakrabarti and Ramakrishnan 33

Boolean queries: ExamplesBoolean queries: Examples

Simple queries involving relationships Simple queries involving relationships between terms and documentsbetween terms and documents Documents containing the word Documents containing the word JavaJava Documents containing the word Documents containing the word Java Java but not the but not the

word word coffeecoffee Proximity queriesProximity queries

Documents containing the phrase Documents containing the phrase Java beans Java beans or or the term the term APIAPI

Documents where Documents where Java Java and and island island occur in the occur in the same sentencesame sentence


Document preprocessingDocument preprocessing

TokenizationTokenization Filtering away tagsFiltering away tags Tokens regarded as nonempty sequence of Tokens regarded as nonempty sequence of

characters excluding spaces and punctuations.characters excluding spaces and punctuations. Token represented by a suitable integer,Token represented by a suitable integer, tid tid, ,

typically 32 bitstypically 32 bits Optional: stemming/conflation of wordsOptional: stemming/conflation of words Result: document (did) transformed into a Result: document (did) transformed into a

sequence of integers (sequence of integers (tid, postid, pos))


Storing tokensStoring tokens

Straight-forward implementation using a Straight-forward implementation using a relational databaserelational database Example figureExample figure Space scales to almost 10 timesSpace scales to almost 10 times

Accesses to table show common patternAccesses to table show common pattern reduce the storage by mapping reduce the storage by mapping tidtidss to a to a

lexicographically sorted buffer of lexicographically sorted buffer of ((did, posdid, pos) ) tuples.tuples.

Indexing = transposing document-term matrixIndexing = transposing document-term matrix


Two variants of the inverted index data structure, usually stored on disk. The simplerversion in the middle does not store term offset information; the version to the right stores termoffsets. The mapping from terms to documents and positions (written as “document/position”) maybe implemented using a B-tree or a hash-table.


StorageStorage

For dynamic corporaFor dynamic corpora Berkeley DBBerkeley DB2 storage manager2 storage manager Can frequently add, modify and delete Can frequently add, modify and delete

documentsdocuments For static collectionsFor static collections

Index compression techniques (to be Index compression techniques (to be discussed)discussed)


StopwordsStopwords

Function words Function words and connectivesand connectives Appear in large number of documents and little use Appear in large number of documents and little use

in pinpointing documentsin pinpointing documents Indexing stopwordsIndexing stopwords

Stopwords not indexedStopwords not indexed For reducing index space and improving performanceFor reducing index space and improving performance

Replace stopwords with a placeholder (to remember the Replace stopwords with a placeholder (to remember the offset)offset)

IssuesIssues Queries containing only stopwords ruled outQueries containing only stopwords ruled out Polysemous words that are stopwords in one sense but not Polysemous words that are stopwords in one sense but not

in othersin others E.g.; E.g.; cancan as a verb vs. as a verb vs. can can as a nounas a noun


StemmingStemming

Conflating words to help match a query term with a Conflating words to help match a query term with a morphological variant in the corpus.morphological variant in the corpus.

Remove inflections that convey parts of speech, tense and Remove inflections that convey parts of speech, tense and numbernumber

E.g.: E.g.: university university and and universal both stem universal both stem to to universeuniverse.. TechniquesTechniques

morphological analysis (e.g., Porter's algorithm)morphological analysis (e.g., Porter's algorithm) dictionary lookup (e.g., WordNetdictionary lookup (e.g., WordNet).).

Stemming may increase recall but at the price of precisionStemming may increase recall but at the price of precision Abbreviations, polysemy and names coined in the technical and Abbreviations, polysemy and names coined in the technical and

commercial sectorscommercial sectors E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated” to

“gate”, may be bad“gate”, may be bad ! !


Batch indexing and updatesBatch indexing and updates

Incremental indexingIncremental indexing Time-consuming due to random disk IOTime-consuming due to random disk IO High level of disk block fragmentationHigh level of disk block fragmentation

Simple sort-merges.Simple sort-merges. To replace the indexed update of variable-length To replace the indexed update of variable-length

postingspostings For a dynamic collectionFor a dynamic collection

single document-level change may need to update single document-level change may need to update hundreds to thousands of records.hundreds to thousands of records.

Solution : create an additional “stop-press” index.Solution : create an additional “stop-press” index.


Maintaining indices over dynamic collections.


Stop-press indexStop-press index Collection of document in fluxCollection of document in flux

Model document modification as deletion followed by insertionModel document modification as deletion followed by insertion Documents in flux represented by a Documents in flux represented by a signed signed record record ((d,t,sd,t,s)) ““s” specifies if “d” has been deleted or inserteds” specifies if “d” has been deleted or inserted..

Getting the final answer to a queryGetting the final answer to a query Main index returns a document set Main index returns a document set DD00.. Stop-press index returns two document setsStop-press index returns two document sets

DD++ : documents not yet indexed in : documents not yet indexed in DD0 0 matching the querymatching the query D- : documents matching the query removed from the collection since D- : documents matching the query removed from the collection since DD0 0

was constructed.was constructed.

Stop-press index getting too largeStop-press index getting too large Rebuild the main indexRebuild the main index

signed (signed (d, t, sd, t, s) records are sorted in () records are sorted in (t, d, st, d, s) order and merge-) order and merge-purged into the master (purged into the master (t, dt, d) records) records

Stop-press index can be emptied out.Stop-press index can be emptied out.


Index compression techniquesIndex compression techniques

Compressing the index so that much of it can Compressing the index so that much of it can be held in memorybe held in memory Required for high-performance IR installations (as Required for high-performance IR installations (as

with Web search engines),with Web search engines), Redundancy in index storageRedundancy in index storage

Storage of document IDs.Storage of document IDs. Delta encodingDelta encoding

Sort Doc IDs in increasing orderSort Doc IDs in increasing order Store the first ID in fullStore the first ID in full Subsequently store only difference Subsequently store only difference (gap)(gap) from from

previous IDprevious ID


Encoding gapsEncoding gaps

Small gap must cost far fewer bits than a Small gap must cost far fewer bits than a document ID.document ID.

Binary encodingBinary encoding OptimalOptimal when all symbols are equally likelywhen all symbols are equally likely

Unary codeUnary code optimal if probability of large gaps decays optimal if probability of large gaps decays

exponentiallyexponentially


Encoding gapsEncoding gaps

Gamma code Gamma code Represent gap Represent gap x x asas

Unary code for followed byUnary code for followed by represented in binary ( represented in binary ( bits)bits)

Golomb codesGolomb codes Further enhancementFurther enhancement

logx 1 logx2 -x logx


Lossy Lossy compression compression mechanismsmechanisms

Trading off space for timeTrading off space for time collect documents into collect documents into bucketsbuckets

Construct inverted index from terms to bucket IDsConstruct inverted index from terms to bucket IDs Document' IDs shrink to half their size.Document' IDs shrink to half their size.

Cost: time overheadsCost: time overheads For each query, all documents in that bucket need to be For each query, all documents in that bucket need to be

scannedscanned Solution: index documents in each bucket Solution: index documents in each bucket

separatelyseparately E.g.: Glimpse (E.g.: Glimpse (http://webglimpse.org/http://webglimpse.org/))


General dilemmasGeneral dilemmas

Messy updates vs. High compression rateMessy updates vs. High compression rate Storage allocation vs. Random I/OsStorage allocation vs. Random I/Os Random I/O vs. large scale implementationRandom I/O vs. large scale implementation


Relevance rankingRelevance ranking

Keyword queriesKeyword queries In natural languageIn natural language Not precise, unlike SQLNot precise, unlike SQL

Boolean decision for response unacceptableBoolean decision for response unacceptable SolutionSolution

Rate each document for how likely it is to satisfy the user's Rate each document for how likely it is to satisfy the user's information needinformation need

Sort in decreasing order of the scoreSort in decreasing order of the score Present results in a ranked list.Present results in a ranked list.

No algorithmic way of ensuring that the ranking No algorithmic way of ensuring that the ranking strategy always favors the information needstrategy always favors the information need Query: only a part of the user's information needQuery: only a part of the user's information need


Responding to queriesResponding to queries

Set-valued responseSet-valued response Response set may be very largeResponse set may be very large

(E.g., by recent estimates, over 12 million Web pages (E.g., by recent estimates, over 12 million Web pages contain the word contain the word javajava.).)

Demanding selective query from userDemanding selective query from user Guessing user's information need and Guessing user's information need and

ranking ranking responsesresponses Evaluating rankingsEvaluating rankings


Evaluating procedureEvaluating procedure

Given benchmarkGiven benchmark Corpus of Corpus of n n documents documents D D A set of queries A set of queries QQ For each query, an exhaustive set of For each query, an exhaustive set of

relevant documents identified manuallyrelevant documents identified manually Query submitted systemQuery submitted system

Ranked list of documents retrievedRanked list of documents retrieved compute a 0/1 relevance listcompute a 0/1 relevance list

iffiff otherwise.otherwise.

Q q D Dq

)d ,,d ,(d n21 )r.., ,r ,(r n21

D d qi 1 ri 0 ri


Recall and precisionRecall and precision

Recall Recall at rankat rank Fraction of all relevant documents included in Fraction of all relevant documents included in

. . ..

Precision Precision at rankat rank Fraction of the top Fraction of the top k k responses that are responses that are

actually relevant.actually relevant. ..

1 k

)d ,,d ,(d n21

ki1

iq

r |D|

1 recall(k)

ki1

irk

1 k)precision(


Other measuresOther measures

Average precision Average precision Sum of precision at each relevant hit position in the response Sum of precision at each relevant hit position in the response

list, divided by the total number of relevant documentslist, divided by the total number of relevant documents . .. . avg.precision =1 iff engine retrieves all relevant documents avg.precision =1 iff engine retrieves all relevant documents

and ranks them ahead of any irrelevant documentand ranks them ahead of any irrelevant document Interpolated precisionInterpolated precision

To combine precision values from multiple queriesTo combine precision values from multiple queries Gives precision-vs.-recall curve for the benchmark.Gives precision-vs.-recall curve for the benchmark.

For each query, For each query, take the maximum precision obtained for the take the maximum precision obtained for the query for any recall greater than or equal to query for any recall greater than or equal to

average them together for all queriesaverage them together for all queries Others like measures of authority, prestige etcOthers like measures of authority, prestige etc

||k1k

q

)(*r |D|

1 ionavg.precis

D

kprecision


Precision-Recall tradeoffPrecision-Recall tradeoff

Interpolated precision cannot increase with recallInterpolated precision cannot increase with recall Interpolated precision at recall level 0 may be less than 1Interpolated precision at recall level 0 may be less than 1

At level k At level k = 0= 0 Precision (by convention) = 1, Recall = 0Precision (by convention) = 1, Recall = 0

Inspecting more documentsInspecting more documents Can increase recallCan increase recall Precision may decreasePrecision may decrease

we will start encountering more and more irrelevant documentswe will start encountering more and more irrelevant documents Search engine with a good ranking function will Search engine with a good ranking function will

generally show a negative relation between recall generally show a negative relation between recall and precision.and precision. Higher the curve, better the engineHigher the curve, better the engine


Precision and interpolated precision plotted against recall for the given relevance vector.Missing are zeroes.kr


The vector space modelThe vector space model

Documents represented as vectors in a multi-Documents represented as vectors in a multi-dimensional Euclidean spacedimensional Euclidean space Each axis = a term (token)Each axis = a term (token)

Coordinate of document Coordinate of document d d in direction of term in direction of term t t determined by:determined by: Term frequency Term frequency TF(TF(d,td,t))

number of times term number of times term t t occurs in document occurs in document dd, scaled in a , scaled in a variety of ways to normalize document lengthvariety of ways to normalize document length

Inverse document frequency Inverse document frequency IDF(IDF(tt)) to scale down the coordinates of terms that occur in many to scale down the coordinates of terms that occur in many

documentsdocuments


Term frequency Term frequency

. . . .

Cornell SMART system uses a smoothed Cornell SMART system uses a smoothed versionversion

) n(d,

t)n(d, t)TF(d,

)) (n(d,max

t)n(d, t)TF(d,

)),(1log(1),(

0),(

tdntdTF

tdTF

otherwise

tdn 0),(


Inverse document frequencyInverse document frequency

GivenGiven D D is the document collection and is the set is the document collection and is the set

of documents containing of documents containing tt FormulaeFormulae

mostly dampened functions of mostly dampened functions of SMARTSMART

..

|| tD

D

)||

||1log()(

tD

DtIDF

tD


Vector space modelVector space model

Coordinate of document Coordinate of document d d in axis in axis t t .. Transformed to Transformed to inin the TFIDF-space the TFIDF-space

Query Query q q Interpreted as a documentInterpreted as a document Transformed to Transformed to inin the same TFIDF-space as the same TFIDF-space as

dd

)(),( tIDFtdTFdt d

q


Measures of proximityMeasures of proximity

Distance measureDistance measure Magnitude of the vector differenceMagnitude of the vector difference

.. Document vectors must be normalized to unit ( Document vectors must be normalized to unit (

or or ) length) length Else shorter documents dominate (since queries are Else shorter documents dominate (since queries are

short)short)

Cosine similarityCosine similarity cosine cosine of the angle between and of the angle between and

Shorter documents are penalizedShorter documents are penalized

|| qd

1L 2L

d

q


Relevance feedback Relevance feedback

Users Users learning learning how to modify querieshow to modify queries Response list must have least some relevant documentsResponse list must have least some relevant documents Relevance feedback Relevance feedback

`correcting' the ranks to the user's taste`correcting' the ranks to the user's taste automates the query refinement processautomates the query refinement process

Rocchio's methodRocchio's method Folding-in user feedbackFolding-in user feedback To query vector To query vector

Add Add a weighted sum of vectors for relevant documents a weighted sum of vectors for relevant documents DD++ Subtract a weighted sum of the irrelevant documents Subtract a weighted sum of the irrelevant documents D-D-

..

q

D -D

d-dq'q


Relevance feedback (contd.)Relevance feedback (contd.)

PseudoPseudo-relevance feedback-relevance feedback D+ and D- generated automaticallyD+ and D- generated automatically

E.g.: Cornell SMART systemE.g.: Cornell SMART system top 10 documents reported by the first round of query top 10 documents reported by the first round of query

execution are included in execution are included in DD++ typically set to 0; D- not usedtypically set to 0; D- not used

Not a commonly available featureNot a commonly available feature Web users want instant gratificationWeb users want instant gratification System complexitySystem complexity

Executing the second round query slower and expensive Executing the second round query slower and expensive for major search enginesfor major search engines


Meta-search systemsMeta-search systems

• Take the search engine to the documentTake the search engine to the document Forward queries to many geographically distributed Forward queries to many geographically distributed

repositoriesrepositories• Each has its own search serviceEach has its own search service

Consolidate their responses.Consolidate their responses.• AdvantagesAdvantages

Perform non-trivial query rewriting Perform non-trivial query rewriting • Suit a single user query to many search engines with different Suit a single user query to many search engines with different

query syntaxquery syntax Surprisingly small overlap between crawlsSurprisingly small overlap between crawls

• Consolidating responsesConsolidating responses Function goes beyond just eliminating duplicatesFunction goes beyond just eliminating duplicates Search services do not provide standard ranks which can be Search services do not provide standard ranks which can be

combined meaningfullycombined meaningfully


Similarity searchSimilarity search

• Cluster hypothesisCluster hypothesis Documents similar to relevant documents are Documents similar to relevant documents are

also likely to be relevantalso likely to be relevant• Handling “find similar” queriesHandling “find similar” queries

Replication Replication or or duplicationduplication of pages of pages Mirroring of sitesMirroring of sites


Document similarityDocument similarity

• Jaccard coefficientJaccard coefficient of similarity between of similarity between document and document and

• T(d) = set of tokens in document dT(d) = set of tokens in document d .. Symmetric, reflexive, not a metricSymmetric, reflexive, not a metric Forgives any number of occurrences and any Forgives any number of occurrences and any

permutations of the terms.permutations of the terms.• is a metricis a metric

1d 2d

|)()(|

|)()(|),('

21

2121 dTdT

dTdTddr

),('1 21 ddr


Estimating Jaccard coefficient Estimating Jaccard coefficient with random permutationswith random permutations

1.1. Generate a set of Generate a set of m m random random permutations permutations

2.2. forfor each each dodo

3.3. compute and compute and

4.4. check ifcheck if

5.5. end forend for

6.6. if equality was observed in if equality was observed in k k cases, cases, estimate.estimate.

m

kddr ),(' 21

)(min)(min 21 dTdT

)( 2d)( 1d

web search & information retrieval. web search engines rooted in information retrieval (ir)...

Documents

term offset information

keyword index

keyword queries

term apidocuments

query term

internettitle search

index space

ramakrishnanboolean