lecture 12 ir in google age. traditional ir traditional ir examples – searching a university...
TRANSCRIPT
Lecture 12
IR in Google Age
Traditional IR
• Traditional IR examples– Searching a university library– Finding an article in a journal archive– Searching your own computer file space
• Spotlight in OS X• Windows Desktop Search• Lucene
– In these cases, often an expert such as a librarian is used. (Hopefully, the expert in your own files is you).
Traditional IR Models
• 3 basic search techniques for traditional IR– Boolean models– Vector models– Probabilistic models
Boolean• One of the earliest• Variations still in many libraries• Boolean operators – AND, OR, NOT
– Remember DeMorgan’s Theorem ?
• Operates by analyzing whether keywords are absent or present in a document
• There are no partial matches– A document is either relevant or irrelevant– Fuzzy set techniques are used to attempt to lessen this black & whiteness
• Has problems with synonymy & polysemy– Cases of many words having same meaning– Cases of single word meaning many things
Boolean (continued)
• Synonymy examples– Something that is described as ‘academic’ might
also be described as theoretical, scholarly, or pedantic
• Polysemy examples– Hot
• Could mean high temperature• Could mean spicy• Could be an adjective for a person’s attractiveness
• On the upside –– Relatively easy to create & program a boolean
engine– Fast; easy to process in parallel (eg scanning
through multiple document keyword files at the same time
– Scales readily to large document collections (corpora)
Boolean (continued)
Vector Space Model• Have already seen some of its features• Developed in early 60’s to address some of the shortcomings of the Boolean
model• Advanced Vector Space Models such as LSI (Latent Semantic Indexing) can
identify hidden semantic meaning– For example, an LSI search engine will also return documents containing
“automobile” when the query term “car” is used
• 2 particular advantages to Vector Space Model– Relevance Scoring– Relevance Feedback
Vector Space Model (cont)• Relevance Scoring
– VSM allows documents to partially match a query– This allows an assignment of a degree, or score, of relevancy which, in turn,
can be sorted
• Relevance Feedback– VSM permits ‘tuning’ of query
• User can select a subset of the retrieved documents and resubmit them• Query is then resubmitted with this additional information• A revised, generally more useful documents, is retrieved
Vector Space Model
• On the downside …– Drawback to Vector Space Model is computational
expense • Distance measures, aka similarity measures, between
query & document must be computed for each document
• Big matrix computations• Remember the length of a vector• Vector length likely grows with collection growth
because of more terms (& also more documents to search)
Probabilistic Models
• Attempt to estimate probability of a document’s relevancy to a particular user
• Retrieved documents ranked by odds of relevance– Ratio of probability of is relevant to probability that the
document is not relevant
• After an initial ‘guess’ by the algorithm, the model operates recursively, seeking to improve the accuracy of the probabilities
Google’s Page Rank & Beyond; Langville, Meyer
• Upside– Can be tuned to researcher/user’s preferences
• Researcher can set or drive probabilities as they desire
– Potentially offers strong tailorability
• Downside– Difficult to build & program– Does not scale well; complexity grows quickly
Probabilistic Models
Web IR
• Web is world’s largest & linked document collection (corpus)
• Per Langville & Meyer, 4 particular characteristics of Web are:– Enormous– Dynamic– Self-organized– Hyperlinked
Web IR• Enormous
– Speaks for itself• Dynamic
– Virtually anyone can do almost anything on the web at any time• Self-organized
– No top down governance or rules (or at least not much) on:• Content• Structure• format
– Hyperlinked• Documents point to & reference each other in a robust, knowable
way
Web IR• Web Search process components
– Crawler/spider• Software to collect the documents
– Page Repository• Complete web pages are temporarily stored in total• Stored until indexing component parses needed data• Frequently accessed pages might be stored indefinitely
– Indexing component• Strips out & stores needed data
– In effect creating a compressed page• Original page is tossed
– unless frequently accessed
Web IR• Web Search process components (cont)
– Indexes themselves• Content indexes using Inverted File Structure
– eg, this word found in these documents
– Query module• Converts users natural language into a query
– A Query object in Lucene’s case
• Runs this query against the indices from the document collection• Returns relevant documents
– A Hit object in Lucene’s case
• This set of relevant pages is passed to the Ranking module
– Ranking module• Combines content score for relevance and also popularity score• Popularity score steps us into Link Analysis & Googleness
Link Analysis
• In 1998, intense link analysis research was being done by two different groups– Jon Kleinberg @ IBM in Silicon Valley– Sergey Brin & Larry Page, two PhD students @
Stanford
• Kleinberg model called HITS – Hypertext Induced Topic Search
• Brin/Page model called PageRank
• Sergey/Brin began developing a search business out of their dorm rooms– Took academic leave to pursue the commercial
aspects of their company
• Kleinberg remained with academia (now @ Cornell) and did not pursue a company
• Sergey & Brin are still on academic leave
Link Analysis
Page Rank1 2
3
6 5
4
Google’s Page Rank & Beyond; Langville, Meyer
∑ r( Pj )
| Pj|Pj ε BPi
r( Pi) =
The PageRank of a particular page is the sum of thePageRanks of all pages pointing to that page.
r( Pi) is the PageRank of page Pi
Bpi is the set of pages pointing into page P i
| Pj| is the number of all outlinks from Pj