lecture 12 ir in google age. traditional ir traditional ir examples – searching a university...

Lecture 12

IR in Google Age

Traditional IR

• Traditional IR examples– Searching a university library– Finding an article in a journal archive– Searching your own computer file space

• Spotlight in OS X• Windows Desktop Search• Lucene

– In these cases, often an expert such as a librarian is used. (Hopefully, the expert in your own files is you).

Traditional IR Models

• 3 basic search techniques for traditional IR– Boolean models– Vector models– Probabilistic models

Boolean• One of the earliest• Variations still in many libraries• Boolean operators – AND, OR, NOT

– Remember DeMorgan’s Theorem ?

• Operates by analyzing whether keywords are absent or present in a document

• There are no partial matches– A document is either relevant or irrelevant– Fuzzy set techniques are used to attempt to lessen this black & whiteness

• Has problems with synonymy & polysemy– Cases of many words having same meaning– Cases of single word meaning many things

Boolean (continued)

• Synonymy examples– Something that is described as ‘academic’ might

also be described as theoretical, scholarly, or pedantic

• Polysemy examples– Hot

• Could mean high temperature• Could mean spicy• Could be an adjective for a person’s attractiveness

• On the upside –– Relatively easy to create & program a boolean

engine– Fast; easy to process in parallel (eg scanning

through multiple document keyword files at the same time

– Scales readily to large document collections (corpora)

Boolean (continued)

Vector Space Model• Have already seen some of its features• Developed in early 60’s to address some of the shortcomings of the Boolean

model• Advanced Vector Space Models such as LSI (Latent Semantic Indexing) can

identify hidden semantic meaning– For example, an LSI search engine will also return documents containing

“automobile” when the query term “car” is used

• 2 particular advantages to Vector Space Model– Relevance Scoring– Relevance Feedback

Vector Space Model (cont)• Relevance Scoring

– VSM allows documents to partially match a query– This allows an assignment of a degree, or score, of relevancy which, in turn,

can be sorted

• Relevance Feedback– VSM permits ‘tuning’ of query

• User can select a subset of the retrieved documents and resubmit them• Query is then resubmitted with this additional information• A revised, generally more useful documents, is retrieved

Vector Space Model

• On the downside …– Drawback to Vector Space Model is computational

expense • Distance measures, aka similarity measures, between

query & document must be computed for each document

• Big matrix computations• Remember the length of a vector• Vector length likely grows with collection growth

because of more terms (& also more documents to search)

Probabilistic Models

• Attempt to estimate probability of a document’s relevancy to a particular user

• Retrieved documents ranked by odds of relevance– Ratio of probability of is relevant to probability that the

document is not relevant

• After an initial ‘guess’ by the algorithm, the model operates recursively, seeking to improve the accuracy of the probabilities

Google’s Page Rank & Beyond; Langville, Meyer

• Upside– Can be tuned to researcher/user’s preferences

• Researcher can set or drive probabilities as they desire

– Potentially offers strong tailorability

• Downside– Difficult to build & program– Does not scale well; complexity grows quickly

Probabilistic Models

Web IR

• Web is world’s largest & linked document collection (corpus)

• Per Langville & Meyer, 4 particular characteristics of Web are:– Enormous– Dynamic– Self-organized– Hyperlinked

Web IR• Enormous

– Speaks for itself• Dynamic

– Virtually anyone can do almost anything on the web at any time• Self-organized

– No top down governance or rules (or at least not much) on:• Content• Structure• format

– Hyperlinked• Documents point to & reference each other in a robust, knowable

way

Web IR• Web Search process components

– Crawler/spider• Software to collect the documents

– Page Repository• Complete web pages are temporarily stored in total• Stored until indexing component parses needed data• Frequently accessed pages might be stored indefinitely

– Indexing component• Strips out & stores needed data

– In effect creating a compressed page• Original page is tossed

– unless frequently accessed

Web IR• Web Search process components (cont)

– Indexes themselves• Content indexes using Inverted File Structure

– eg, this word found in these documents

– Query module• Converts users natural language into a query

– A Query object in Lucene’s case

• Runs this query against the indices from the document collection• Returns relevant documents

– A Hit object in Lucene’s case

• This set of relevant pages is passed to the Ranking module

– Ranking module• Combines content score for relevance and also popularity score• Popularity score steps us into Link Analysis & Googleness

Link Analysis

• In 1998, intense link analysis research was being done by two different groups– Jon Kleinberg @ IBM in Silicon Valley– Sergey Brin & Larry Page, two PhD students @

Stanford

• Kleinberg model called HITS – Hypertext Induced Topic Search

• Brin/Page model called PageRank

• Sergey/Brin began developing a search business out of their dorm rooms– Took academic leave to pursue the commercial

aspects of their company

• Kleinberg remained with academia (now @ Cornell) and did not pursue a company

• Sergey & Brin are still on academic leave

Link Analysis

Page Rank1 2

3

6 5

4

Google’s Page Rank & Beyond; Langville, Meyer

∑ r( Pj )

| Pj|Pj ε BPi

r( Pi) =

The PageRank of a particular page is the sum of thePageRanks of all pages pointing to that page.

r( Pi) is the PageRank of page Pi

Bpi is the set of pages pointing into page P i

| Pj| is the number of all outlinks from Pj