information retrieval and web search adopted from slides from bin liu @uic & christopher manning...

Information Retrieval and Web Search

Adopted from Slides from Bin Liu @UIC & Christopher Manning and Prabhakar Raghavan @ Stanford

Introduction• Text mining refers to data mining using text

documents as data. • Most text mining tasks use Information

Retrieval (IR) methods to pre-process text documents.

• These methods are quite different from traditional data pre-processing methods used for relational tables.

• Web search also has its root in IR.

2

Information Retrieval (IR)

• Conceptually, IR is the study of finding needed information. I.e., IR helps users find information that matches their information needs. – Expressed as queries

• Historically, IR is about document retrieval, emphasizing document as the basic unit.– Finding documents relevant to user queries

• Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.

3

IR architecture

4

IR queries

• Keyword queries• Boolean queries (using AND, OR, NOT)• Phrase queries• Proximity queries• Full document queries• Natural language questions

5

Information retrieval models• An IR model governs how a document and a

query are represented and how the relevance of a document to a user query is defined.

• Main models:– Boolean model– Vector space model– Statistical language model– etc

6

Boolean model• Each document or query is treated as a “bag”

of words or terms. Word sequence is not considered.

• Given a collection of documents D, let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary.

• A weight wij > 0 is associated with each term ti of a document dj ∈ D. For a term that does not appear in document dj, wij = 0.

dj = (w1j, w2j, ..., w|V|j),

7

Boolean model (contd)• Query terms are combined logically using the

Boolean operators AND, OR, and NOT.– E.g., ((data AND mining) AND (NOT text))

• Retrieval– Given a Boolean query, the system retrieves every

document that makes the query logically true.– Called exact match.

• The retrieval results are usually quite poor because term frequency is not considered.

8

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:– Boolean Queries are queries using AND, OR and NOT

to join query terms• Views each document as a set of words• Is precise: document matches condition or not.

– Perhaps the simplest model to build an IR system on

• Primary commercial retrieval tool for 3 decades. • Many search systems you still use are Boolean:

– Email, library catalog, Mac OS X Spotlight9

Sec. 1.3

Example: WestLaw http://www.westlaw.com/

• Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992)

• Tens of terabytes of data; 700,000 users• Majority of users still use boolean queries• Example query:

– What is the statute of limitations in cases involving the federal tort claims act?

– LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

• /3 = within 3 words, /S = in same sentence10

Sec. 1.4

Example: WestLaw http://www.westlaw.com/

• Another example query:– Requirements for disabled people to be able to

access a workplace– disabl! /p access! /s work-site work-place

(employment /3 place• Note that SPACE is disjunction, not conjunction!• Long, precise queries; proximity operators;

incrementally developed; not like web search• Many professional searchers still like Boolean

search– You know exactly what you are getting

• But that doesn’t mean it actually works better….

Sec. 1.4

Vector space model• Documents are also treated as a “bag” of words or

terms. • Each document is represented as a vector. • However, the term weights are no longer 0 or 1. Each

term weight is computed based on some variations of TF or TF-IDF scheme.

• Term Frequency (TF) Scheme: The weight of a term ti in document dj is the number of times that ti appears in dj, denoted by fij. Normalization may also be applied.

12

TF-IDF term weighting scheme• The most well known

weighting scheme– TF: still term frequency– IDF: inverse document

frequency. N: total number of docsdfi: the number of docs that ti

appears.

• The final TF-IDF term weight is:

13

Retrieval in vector space model• Query q is represented in the same way or slightly

differently. • Relevance of di to q: Compare the similarity of query q

and document di.

• Cosine similarity (the cosine of the angle between the two vectors)

• Cosine is also commonly used in text clustering14

An Example• A document space is defined by three terms:

– hardware, software, users– the vocabulary

• A set of documents are defined as:– A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)– A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)– A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1)

• If the Query is “hardware and software”• what documents should be retrieved?

15

An Example (cont.)• In Boolean query matching:

– document A4, A7 will be retrieved (“AND”)– retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)

• In similarity matching (cosine): – q=(1, 1, 0)– S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0– S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5– S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5– Document retrieved set (with ranking)=

• {A4, A7, A1, A2, A5, A6, A8, A9}

16

Okapi relevance method• Another way to assess the degree of relevance is to

directly compute a relevance score for each document to the query.

• The Okapi method and its variations are popular techniques in this setting.

17

Relevance feedback• Relevance feedback is one of the techniques for

improving retrieval effectiveness. The steps:– the user first identifies some relevant (Dr) and irrelevant

documents (Dir) in the initial list of retrieved documents

– the system expands the query q by extracting some additional terms from the sample relevant and irrelevant documents to produce qe

– Perform a second round of retrieval.

• Rocchio method (α, β and γ are parameters)

18

Rocchio text classifier• In fact, a variation of the Rocchio method above, called

the Rocchio classification method, can be used to improve retrieval effectiveness too– so are other machine learning methods. Why?

• Rocchio classifier is constructed by producing a prototype vector ci for each class i (relevant or irrelevant in this case):

• In classification, cosine is used.

19

Text pre-processing

• Word (term) extraction: easy• Stopwords removal• Stemming• Frequency counts and computing TF-IDF

term weights.

20

Stopwords removal

• Many of the most frequently used words in English are useless in IR and text mining – these words are called stop words.– the, of, and, to, ….– Typically about 400 to 500 such words– For an application, an additional domain specific stopwords list may

be constructed• Why do we need to remove stopwords?

– Reduce indexing (or data) file size• stopwords accounts 20-30% of total word counts.

– Improve efficiency and effectiveness• stopwords are not useful for searching or text mining• they may also confuse the retrieval system.

21

Stemming• Techniques used to find out the root/stem of a word.

E.g.,– user engineering – users engineered – used engineer – using

• stem: use engineer

Usefulness:• improving effectiveness of IR and text mining

– matching similar words– Mainly improve recall

• reducing indexing size– combing words with same roots may reduce indexing size as

much as 40-50%.

22

Basic stemming methods

Using a set of rules. E.g., • remove ending

– if a word ends with a consonant other than s, followed by an s, then delete s.– if a word ends in es, drop the s.– if a word ends in ing, delete the ing unless the remaining word

consists only of one letter or of th.– If a word ends with ed, preceded by a consonant, delete the ed

unless this leaves only a single letter.– …...

• transform words– if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

23

Frequency counts + TF-IDF

• Counts the number of times a word occurred in a document.– Using occurrence frequencies to indicate relative

importance of a word in a document.• if a word appears often in a document, the document

likely “deals with” subjects related to the word.

• Counts the number of documents in the collection that contains each word

• TF-IDF can be computed.

24

Evaluation: Precision and Recall

• Given a query: – Are all retrieved documents relevant? – Have all the relevant documents been retrieved?

• Measures for system performance:– The first question is about the precision of the

search– The second is about the completeness (recall) of

the search.

25

Precision-recall curve

26

Compare different retrieval algorithms

27

Compare with multiple queries• Compute the average precision at each recall level.

• Draw precision recall curves• Do not forget the F-score evaluation measure.

28

Rank precision

• Compute the precision values at some selected rank positions.

• Mainly used in Web search evaluation.• For a Web search engine, we can compute

precisions for the top 5, 10, 15, 20, 25 and 30 returned pages – as the user seldom looks at more than 30 pages.

• Recall is not very meaningful in Web search.– Why?

29

Web Search as a huge IR system

• A Web crawler (robot) crawls the Web to collect all the pages.

• Servers establish a huge inverted indexing database and other indexing databases

• At query (search) time, search engines conduct different types of vector query matching.

30

Inverted index• The inverted index of a document collection is

basically a data structure that – attaches each distinctive term with a list of all

documents that contains the term.

• Thus, in retrieval, it takes constant time to – find the documents that contains a query term. – multiple query terms are also easy handle as we

will see soon.

31

An example

32

Index construction• Easy! See the example,

33

Search using inverted indexGiven a query q, search has the following steps:• Step 1 (vocabulary search): find each

term/word in q in the inverted index.• Step 2 (results merging): Merge results to find

documents that contain all or some of the words/terms in q.

• Step 3 (Rank score computation): To rank the resulting documents/pages, using, – content-based ranking– link-based ranking

34

Inverted index: Details• For each term t, we must store a list of all

documents that contain t.– Identify each by a docID, a document serial

number

• Can we used fixed-size arrays for this?

35

Brutus

Calpurnia

Caesar

1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

What happens if the word Caesar is added to document 14?

Sec. 1.2

174

54101

Inverted index• We need variable-size postings lists

– On disk, a continuous run of postings is normal and best

– In memory, can use linked lists or variable length arrays

• Some tradeoffs in size/ease of insertion

36

Dictionary Postings

Sorted by docID (more later on why).

PostingPosting

Sec. 1.2

Brutus

CalpurniaCaesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45173

2 31

174

54101

Tokenizer

Token stream. Friends Romans Countrymen

Inverted index construction

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friendromancountryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

Sec. 1.2

Indexer steps: Token sequence

• Sequence of (Modified token, Document ID) pairs.

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

Sec. 1.2

Indexer steps: Sort

• Sort by terms– And then docID

Core indexing step

Sec. 1.2

Indexer steps: Dictionary & Postings

• Multiple term entries in a single document are merged.

• Split into Dictionary and Postings

• Doc. frequency information is added.

Why frequency?

Sec. 1.2

Where do we pay in storage?

41Pointers

Terms and

counts How do we index efficiently?How much storage do we need?

Sec. 1.2

Lists of docIDs

Query processing: AND

• Consider processing the query:Brutus AND Caesar– Locate Brutus in the Dictionary;

• Retrieve its postings.

– Locate Caesar in the Dictionary;• Retrieve its postings.

– “Merge” the two postings:

42

128

34

2 4 8 16 32 64

1 2 3 5 8 13

21

Brutus

Caesar

Sec. 1.3

The merge

• Walk through the two postings simultaneously, in time linear in the total number of postings entries

43

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar2 8

If the list lengths are x and y, the merge takes O(x+y)operations.Crucial: postings sorted by docID.

Sec. 1.3

Intersecting two postings lists(a “merge” algorithm)

44

Query optimization• What is the best order for query

processing?• Consider a query that is an AND of n terms.• For each of the n terms, get its postings,

then AND them together.

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Query: Brutus AND Calpurnia AND Caesar45

Sec. 1.3

Query optimization example

• Process in order of increasing freq:– start with smallest set, then keep cutting further.

46

This is why we kept

document freq. in dictionary

Execute the query as (Calpurnia AND Brutus) AND Caesar.

Sec. 1.3

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64128

13 16

Boolean queries: More general merges

• Exercise: Adapt the merge for the queries:Brutus AND NOT CaesarBrutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

47

Sec. 1.3

Merging

What about an arbitrary Boolean formula?(Brutus OR Caesar) AND NOT(Antony OR Cleopatra)• Can we always merge in “linear” time?

– Linear in what?

• Can we do better?

48

Sec. 1.3

More general optimization

• e.g., (madding OR crowd) AND (ignoble OR strife)

• Get doc. freq.’s for all terms.• Estimate the size of each OR by the sum of its

doc. freq.’s (conservative).• Process in increasing order of OR sizes.

49

Sec. 1.3

Exercise• Recommend a query

processing order for

Term Freq eyes 213312

kaleidoscope 87009

marmalade 107913

skies 271658

tangerine 46653

trees 316812

50

(tangerine OR trees) AND(marmalade OR skies) AND(kaleidoscope OR eyes)

Different search engines • The real differences among different search

engines are – their index weighting schemes

• Including location of terms, e.g., title, body, emphasized words, etc.

– their query processing methods (e.g., query classification, expansion, etc)

– their ranking algorithms– Few of these are published by any of the search

engine companies. They aretightly guarded secrets.

51

Summary• We only give a VERY brief introduction to IR. There are a

large number of other topics, e.g., – Statistical language model– Latent semantic indexing (LSI and SVD).– (read an IR book or take an IR course)

• Many other interesting topics are not covered, e.g., – Web search

• Index compression• Ranking: combining contents and hyperlinks

– Web page pre-processing– Combining multiple rankings and meta search– Web spamming

• Want to know more? Read the textbook

52

information retrieval and web search adopted from slides from bin liu @uic & christopher manning...

Documents

document retrieval

boolean query

boolean modeleach document

boolean queriesexample

document dj

boolean operators

boolean expression

user query