©2003 paula matuszek csc 9010: search engines google dr. paula matuszek (610) 270-6851
DESCRIPTION
©2003 Paula Matuszek Selecting Relevant Documents l Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files l Remaining components must parse, index, find, and rank documents. l Traditional approach is based on the words in the documents (predates the web)TRANSCRIPT
©2003 Paula Matuszek
Search Engine Basics A spider or crawler starts at a web page, identifies all
links on it, and follows them to new web pages. A parser processes each web page and extracts
individual words. An indexer creates/updates a hash table which
connects words with documents A searcher uses the hash table to retrieve documents
based on words A ranking system decides the order in which to
present the documents: their relevance
©2003 Paula Matuszek
Selecting Relevant Documents
Assume:– we already have a corpus of documents defined. – goal is to return a subset of those documents.– Individual documents have been separated into
individual files Remaining components must parse, index,
find, and rank documents. Traditional approach is based on the words in
the documents (predates the web)
©2003 Paula Matuszek
Extracting Lexical Features Process a string of characters
– assemble characters into tokens (tokenizer)– choose tokens to index
In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex
©2003 Paula Matuszek
Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output
state
Must be very efficient; gets used a LOT
0
1
2
blankA-Z
A-Z
blank, EOF
©2003 Paula Matuszek
Design Issues for Lexical Analyser
Punctuation– treat as whitespace?– treat as characters?– treat specially?
Case– fold?
Digits– assemble into numbers?– treat as characters?– treat as punctuation?
©2003 Paula Matuszek
Lexical Analyser Output of lexical analyser is a string of
tokens Remaining operations are all on these
tokens We have already thrown away some
information; makes more efficient, but limits the power of our search
©2003 Paula Matuszek
Stemming Additional processing at the token level Turn words into a canonical form:
– “cars” into “car”– “children” into “child”– “walked” into “walk”
Decreases the total number of different tokens to be processed
Decreases the precision of a search, but increases its recall
©2003 Paula Matuszek
Stemming -- How? Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-
sensitive transformation grammar:– (.*)SSES -> /1SS
50-250 rules are typical
©2003 Paula Matuszek
Noise Words (Stop Words) Function words that contribute little or
nothing to meaning Very frequent words
– If a word occurs in every document, it is not useful in choosing among documents
– However, need to be careful, because this is corpus-dependent
Often implemented as a discrete list
©2003 Paula Matuszek
Example Corpora We are assuming a fixed corpus. Some sample
corpora:– Medline Abstracts– Email. Anyone’s email.– Reuters corpus– Brown corpus
Textual fields, structured attributes– Textual: free, unformatted, no meta-information– Structured: additional information beyond the content
©2003 Paula Matuszek
Structured Atributes for Medline
Pubmed ID Author Year Keywords Journal
©2003 Paula Matuszek
Textual Fields for Medline Abstract
– Reasonably complete standard academic English
– Capturing the basic meaning of document Title
– Short, formalized – Captures most critical part of meaning– Proxy for abstract
©2003 Paula Matuszek
Structured Fields for Email To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)
©2003 Paula Matuszek
Text fields for Email Subject
– Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate.
Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.
©2003 Paula Matuszek
Indexing We have a tokenized, stemmed sequence of
words Next step is to parse document, extracting
index terms– Assume that each token is a word and we don’t
want to recognize any more complex structures than single words.
When all documents are processed, create index
©2003 Paula Matuszek
Basic Indexing Algorithm For each document in the corpus
– get the next token– save the posting in a list
– doc ID, frequency For each token found in the corpus
– calculate #doc, total frequency– sort by frequency
©2003 Paula Matuszek
Fine Points Dynamic Corpora: requires incremental
algorithms Higher-resolution data (eg, char position) Giving extra weight to proxy text (typically by
doubling or tripling frequency count) Document-type-specific processing
– In HTML, want to ignore tags– In email, maybe want to ignore quoted material
©2003 Paula Matuszek
Choosing Keywords Don’t necessarily want to index on every
word– Takes more space for index– Takes more processing time– May not improve our resolving power
How do we choose keywords?– Manually– Statistically
Exhaustivity vs specificity
©2003 Paula Matuszek
Manually Choosing Keywords
Unconstrained vocabulary: allow creator of document to choose whatever he/she wants– “best” match– captures new terms easily– easiest for person choosing keywords
Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations– more consistent– easier for searching; possible “magic bullet” search
©2003 Paula Matuszek
Examples of Constrained Vocabularies
ACM headings (www.acm.org/class/1998)– H: Information Retrieval
H3: Information Storage and Retrieval H3.3: Information Search and Retrieval
» Clustering» Query formulation » Relevance feedback » Search process etc.
Medline Headings (www.nlm.nih.gov/mesh/meshhome.html)– L: Information Science
L01: Information Science L01.700: Medical Informatics L01.700.508: Medical Informatics Applications L01.700.508.280: Information Storage and Retrieval
» Grateful Med [L01.700.508.280.400]
©2003 Paula Matuszek
Automated Vocabulary Selection
Frequency: Zipf’s Law. – Pn = 1/na, where Pn is the frequency of occurrence of the nth
ranked item and a is close to 1
– Within one corpus, words with middle frequencies are typically “best”
Document-oriented representation bias: lots of keywords/document
Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.
©2003 Paula Matuszek
Choosing Keywords “Best” depends on actual use; if a word only
occurs in one document, may be very good for retrieving that document; not, however, very effective overall.
Words which have no resolving power within a corpus may be best choices across corpora
Not very important for web searching; will be more relevant for text mining.
©2003 Paula Matuszek
Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are
likely to be common search terms Queries that people want to make are wide-
ranging and unpredictable Therefore: can’t limit keywords, except
possibly to eliminate stop words. Even stop words are language-dependent. So
determine language first.
©2003 Paula Matuszek
Comparing and Ranking Documents
Once our search engine has retrieved a set of documents, we may want to
Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is about and
how well the document answers it Compare them
– Show me more like this.– This involves determining what the document is about.
©2003 Paula Matuszek
Determining Relevance by Keyword
The typical web query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of
relatedness: how much does this document reflect what the query is about?
Simple strategies:– How many times does word occur in document?– How close to head of document?– If multiple keywords, how close together?
©2003 Paula Matuszek
Keywords for Relevance Ranking
Count: repetition is an indication of emphasis– Very fast (usually in the index)– Reasonable heuristic– Unduly influenced by document length– Can be "stuffed" by web designers
Position: Lead paragraphs summarize content– Requires more computation– Also reasonably heuristic– Less influenced by document length– Harder to "stuff"; can only have a few keywords near
beginning
©2003 Paula Matuszek
Keywords for Relevance Ranking
Proximity for multiple keywords– Requires even more computation– Obviously relevant only if have multiple keywords– Effectiveness of heuristic varies with information need;
typically either excellent or not very helpful at all– Very hard to "stuff"
All keyword methods– Are computationally simple and adequately fast– Are effective heuristics– typically perform as well as in-depth natural language
methods for standard search
©2003 Paula Matuszek
Comparing Documents "Find me more like this one" really means that
we are using the document as a query. This requires that we have some conception
of what a document is about overall. Depends on context of query. We need to
– Characterize the entire content of this document– Discriminate between this document and others in
the corpus
©2003 Paula Matuszek
Characterizing a Document: Term Frequency
A document can be treated as a sequence of words. Each word characterizes that document to some
extent. When we have eliminated stop words, the most
frequent words tend to be what the document is about Therefore: fkd (# of occurrences of word K in
document d) will be an important measure. Also called the term frequency
©2003 Paula Matuszek
Characterizing a Document: Document Frequency
What makes this document distinct from others in the corpus?
The terms which discriminate best are not those which occur with high frequency!
Therefore: Dk (# of documents in which word K occurs) will also be an important measure.
Also called the document frequency
©2003 Paula Matuszek
TF*IDF This can all be summarized as:
– Words are best discriminators when they– occur often in this document (term frequency)– don’t occur in a lot of documents (document frequency)
One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency
There are multiple formulas for actually computing this. The underlying concept is the same in all of them.
©2003 Paula Matuszek
Describing an Entire Document
So what is a document about? TF*IDF: can simply list keywords in
order of their TF*IDF values Document is about all of them to some
degree: it is at some point in some vector space of meaning
©2003 Paula Matuszek
Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge
space -- it is or is not about each of those terms. Consider each term as a vector. Then
– We have an n-dimensional vector space– Where n is the number of terms (very large!)– Each document is a point in that vector space
The document position in this vector space can be treated as what the document is about.
©2003 Paula Matuszek
Similarity Between Documents
How similar are two documents?– Measures of association
– How much do the feature sets overlap?– Modified for length: DICE coefficient
DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) ) # terms compared to intersection
– Simple Matching coefficient: take into account exclusions
– Cosine similarity– similarity of angle of the two document vectors– not sensitive to vector length
©2003 Paula Matuszek
Bag of Words All of these techniques are what is
known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and
"dog bites man" non-existent Later we will discuss linguistic
approaches which pay attention to semantics
©2003 Paula Matuszek
GOOGLE Web-based search engine Don’t have a predefined corpus Have a very wide variety of documents Queries are typically short and simple Are not concerned with document
similiarity Are VERY concerned with relevance
©2003 Paula Matuszek
Google Goals Relevance
– Techniques derived from Information Retrieval are focused on "Is this about my query?"
– BUT: web content is largely unflitered and unreferreed
– So relevance should also focus on "Is this a good web page?"
Scalable Improved search quality Academic research tool
©2003 Paula Matuszek
Determining Relevance Page Rank Anchor Text (also allows indexing of
documents not spidered, such as images)
Proximity Font characteristics
©2003 Paula Matuszek
Page Rank Unique characteristic of GOOGLE Citation graph of the web Focus is on importance of page, not on
about-ness.– NO: Is this what I asked about?– YES: How good is this page?
Still How well does this meet my information need?
©2003 Paula Matuszek
Simple PageRank Model is of a surfer randomly searching
the web; page rank is probability that searcher will reach a specific page
Given by :
WhereB(i) : set of pages links to i.N(j) : number of outgoing links from j.
©2003 Paula Matuszek
Page Rank, cont Can get "bogged down" in highly inter-
connected pages, so add a "damping factor"
"Bored surfer" model--searcher gets bored and moves to random new page :-).
©2003 Paula Matuszek
Google Architecture
©2003 Paula Matuszek
Architecture: Modules Crawlers (distributed) URL server Store Server Indexer Sorter URL Resolver DumpLexicon Searcher
©2003 Paula Matuszek
Architecture: Data Stores Repository: compressed documents, with URL, docID,
length Barrel: for a range of wordIDs, holds list of wordIDs with
hit list, Doc IDs Anchor file: Link information, with source, target and
anchor text. Link database: pairs of docIDs Document Index: indexed by docID, contains status,
pointer to repository. Lexicon: list of words, hash table of pointers to barrels
©2003 Paula Matuszek
GOOGLE API Search requests: submit a query string and a set of
parameters to the Google Web APIs service and receive in return a set of search results
Cache requests:submit a URL to the Google Web APIs service and receive in return the contents of the URL when Google's crawlers last visited the page
Spelling requests: submit a query to the Google Web APIs service and receive in return a suggested spell correction for the query
C:\Villanova\Villanova Fall 2003\google api\googleapi\APIs_Reference.html
©2003 Paula Matuszek
Web Page Freshness Web page "freshness" Junghoo Cho, Stanford, 2001.
– rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
How often do web pages change?
©2003 Paula Matuszek
Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission)
– identified 400 sites with highest “PageRank”– contacted administrators
720,000 pages collected– 3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site
requests
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Average Change Intervalfr
actio
n of
pag
es
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
1day 1day- 1week
1week-1month
1month-4months
4months
average change interval
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Change Interval – By Domain
frac
tion
of p
ages
0
0.1
0.2
0.3
0.4
0.5
0.6
1day 1day- 1week
1week-1month
1month-4months
4months
comnetorgedugov
average change intervalrose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Refresh Strategy Crawlers can refresh only a certain
amount of pages in a period of time. The page download resource can be
allocated in many ways The proportional refresh policy allocated
the resource proportionally to the pages’ change rate.
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Proportional Often Not Good!
Visit fast changing e1 get 1/2 day of freshness
Visit slow changing e2 get 1/2 week of freshness
Visiting e2 is a better deal!
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Comparing PoliciesFreshness Age
Proportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days
Based on Statistics from experimentand revisit frequency of every month
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Not Every Page is Equal!
e1
e2 Accessed by users 20 times/day
Accessed by users 10 times/day
Some pages are “more important”Some pages are “more important”
So weight pages by importance.
rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt
©2003 Paula Matuszek
Allocating Freshness Resources
Visit more important pages more frequently
Visit pages according to a strategy which optimizes "freshness metric".
©2003 Paula Matuszek
Additional Search Issues In addition to improved relevance and
freshness, can improve overall search with some other factors:– Eliminate duplicate documents– Eliminate multiple documents from one site– Provide good context– Clearly identify paid links