©2003 paula matuszek csc 9010: search engines google dr. paula matuszek (610) 270-6851

©2003 Paula Matuszek

CSC 9010: Search EnginesGoogle

Dr. Paula [email protected]

(610) 270-6851


Search Engine Basics A spider or crawler starts at a web page, identifies all

links on it, and follows them to new web pages. A parser processes each web page and extracts

individual words. An indexer creates/updates a hash table which

connects words with documents A searcher uses the hash table to retrieve documents

based on words A ranking system decides the order in which to

present the documents: their relevance


Selecting Relevant Documents

Assume:– we already have a corpus of documents defined. – goal is to return a subset of those documents.– Individual documents have been separated into

individual files Remaining components must parse, index,

find, and rank documents. Traditional approach is based on the words in

the documents (predates the web)


Extracting Lexical Features Process a string of characters

– assemble characters into tokens (tokenizer)– choose tokens to index

In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex


Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output

state

Must be very efficient; gets used a LOT

0

1

2

blankA-Z

A-Z

blank, EOF


Design Issues for Lexical Analyser

Punctuation– treat as whitespace?– treat as characters?– treat specially?

Case– fold?

Digits– assemble into numbers?– treat as characters?– treat as punctuation?


Lexical Analyser Output of lexical analyser is a string of

tokens Remaining operations are all on these

tokens We have already thrown away some

information; makes more efficient, but limits the power of our search


Stemming Additional processing at the token level Turn words into a canonical form:

– “cars” into “car”– “children” into “child”– “walked” into “walk”

Decreases the total number of different tokens to be processed

Decreases the precision of a search, but increases its recall


Stemming -- How? Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-

sensitive transformation grammar:– (.*)SSES -> /1SS

50-250 rules are typical


Noise Words (Stop Words) Function words that contribute little or

nothing to meaning Very frequent words

– If a word occurs in every document, it is not useful in choosing among documents

– However, need to be careful, because this is corpus-dependent

Often implemented as a discrete list


Example Corpora We are assuming a fixed corpus. Some sample

corpora:– Medline Abstracts– Email. Anyone’s email.– Reuters corpus– Brown corpus

Textual fields, structured attributes– Textual: free, unformatted, no meta-information– Structured: additional information beyond the content


Structured Atributes for Medline

Pubmed ID Author Year Keywords Journal


Textual Fields for Medline Abstract

– Reasonably complete standard academic English

– Capturing the basic meaning of document Title

– Short, formalized – Captures most critical part of meaning– Proxy for abstract


Structured Fields for Email To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)


Text fields for Email Subject

– Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate.

Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.


Indexing We have a tokenized, stemmed sequence of

words Next step is to parse document, extracting

index terms– Assume that each token is a word and we don’t

want to recognize any more complex structures than single words.

When all documents are processed, create index


Basic Indexing Algorithm For each document in the corpus

– get the next token– save the posting in a list

– doc ID, frequency For each token found in the corpus

– calculate #doc, total frequency– sort by frequency


Fine Points Dynamic Corpora: requires incremental

algorithms Higher-resolution data (eg, char position) Giving extra weight to proxy text (typically by

doubling or tripling frequency count) Document-type-specific processing

– In HTML, want to ignore tags– In email, maybe want to ignore quoted material


Choosing Keywords Don’t necessarily want to index on every

word– Takes more space for index– Takes more processing time– May not improve our resolving power

How do we choose keywords?– Manually– Statistically

Exhaustivity vs specificity


Manually Choosing Keywords

Unconstrained vocabulary: allow creator of document to choose whatever he/she wants– “best” match– captures new terms easily– easiest for person choosing keywords

Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations– more consistent– easier for searching; possible “magic bullet” search


Examples of Constrained Vocabularies

ACM headings (www.acm.org/class/1998)– H: Information Retrieval

H3: Information Storage and Retrieval H3.3: Information Search and Retrieval

» Clustering» Query formulation » Relevance feedback » Search process etc.

Medline Headings (www.nlm.nih.gov/mesh/meshhome.html)– L: Information Science

L01: Information Science L01.700: Medical Informatics L01.700.508: Medical Informatics Applications L01.700.508.280: Information Storage and Retrieval

» Grateful Med [L01.700.508.280.400]


Automated Vocabulary Selection

Frequency: Zipf’s Law. – Pn = 1/na, where Pn is the frequency of occurrence of the nth

ranked item and a is close to 1

– Within one corpus, words with middle frequencies are typically “best”

Document-oriented representation bias: lots of keywords/document

Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.


Choosing Keywords “Best” depends on actual use; if a word only

occurs in one document, may be very good for retrieving that document; not, however, very effective overall.

Words which have no resolving power within a corpus may be best choices across corpora

Not very important for web searching; will be more relevant for text mining.


Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are

likely to be common search terms Queries that people want to make are wide-

ranging and unpredictable Therefore: can’t limit keywords, except

possibly to eliminate stop words. Even stop words are language-dependent. So

determine language first.


Comparing and Ranking Documents

Once our search engine has retrieved a set of documents, we may want to

Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is about and

how well the document answers it Compare them

– Show me more like this.– This involves determining what the document is about.


Determining Relevance by Keyword

The typical web query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of

relatedness: how much does this document reflect what the query is about?

Simple strategies:– How many times does word occur in document?– How close to head of document?– If multiple keywords, how close together?


Keywords for Relevance Ranking

Count: repetition is an indication of emphasis– Very fast (usually in the index)– Reasonable heuristic– Unduly influenced by document length– Can be "stuffed" by web designers

Position: Lead paragraphs summarize content– Requires more computation– Also reasonably heuristic– Less influenced by document length– Harder to "stuff"; can only have a few keywords near

beginning


Keywords for Relevance Ranking

Proximity for multiple keywords– Requires even more computation– Obviously relevant only if have multiple keywords– Effectiveness of heuristic varies with information need;

typically either excellent or not very helpful at all– Very hard to "stuff"

All keyword methods– Are computationally simple and adequately fast– Are effective heuristics– typically perform as well as in-depth natural language

methods for standard search


Comparing Documents "Find me more like this one" really means that

we are using the document as a query. This requires that we have some conception

of what a document is about overall. Depends on context of query. We need to

– Characterize the entire content of this document– Discriminate between this document and others in

the corpus


Characterizing a Document: Term Frequency

A document can be treated as a sequence of words. Each word characterizes that document to some

extent. When we have eliminated stop words, the most

frequent words tend to be what the document is about Therefore: fkd (# of occurrences of word K in

document d) will be an important measure. Also called the term frequency


Characterizing a Document: Document Frequency

What makes this document distinct from others in the corpus?

The terms which discriminate best are not those which occur with high frequency!

Therefore: Dk (# of documents in which word K occurs) will also be an important measure.

Also called the document frequency


TF*IDF This can all be summarized as:

– Words are best discriminators when they– occur often in this document (term frequency)– don’t occur in a lot of documents (document frequency)

One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency

There are multiple formulas for actually computing this. The underlying concept is the same in all of them.


Describing an Entire Document

So what is a document about? TF*IDF: can simply list keywords in

order of their TF*IDF values Document is about all of them to some

degree: it is at some point in some vector space of meaning


Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge

space -- it is or is not about each of those terms. Consider each term as a vector. Then

– We have an n-dimensional vector space– Where n is the number of terms (very large!)– Each document is a point in that vector space

The document position in this vector space can be treated as what the document is about.


Similarity Between Documents

How similar are two documents?– Measures of association

– How much do the feature sets overlap?– Modified for length: DICE coefficient

DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) ) # terms compared to intersection

– Simple Matching coefficient: take into account exclusions

– Cosine similarity– similarity of angle of the two document vectors– not sensitive to vector length


Bag of Words All of these techniques are what is

known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and

"dog bites man" non-existent Later we will discuss linguistic

approaches which pay attention to semantics


GOOGLE Web-based search engine Don’t have a predefined corpus Have a very wide variety of documents Queries are typically short and simple Are not concerned with document

similiarity Are VERY concerned with relevance


Google Goals Relevance

– Techniques derived from Information Retrieval are focused on "Is this about my query?"

– BUT: web content is largely unflitered and unreferreed

– So relevance should also focus on "Is this a good web page?"

Scalable Improved search quality Academic research tool


Determining Relevance Page Rank Anchor Text (also allows indexing of

documents not spidered, such as images)

Proximity Font characteristics


Page Rank Unique characteristic of GOOGLE Citation graph of the web Focus is on importance of page, not on

about-ness.– NO: Is this what I asked about?– YES: How good is this page?

Still How well does this meet my information need?


Simple PageRank Model is of a surfer randomly searching

the web; page rank is probability that searcher will reach a specific page

Given by :

WhereB(i) : set of pages links to i.N(j) : number of outgoing links from j.


Page Rank, cont Can get "bogged down" in highly inter-

connected pages, so add a "damping factor"

"Bored surfer" model--searcher gets bored and moves to random new page :-).


Google Architecture


Architecture: Modules Crawlers (distributed) URL server Store Server Indexer Sorter URL Resolver DumpLexicon Searcher


Architecture: Data Stores Repository: compressed documents, with URL, docID,

length Barrel: for a range of wordIDs, holds list of wordIDs with

hit list, Doc IDs Anchor file: Link information, with source, target and

anchor text. Link database: pairs of docIDs Document Index: indexed by docID, contains status,

pointer to repository. Lexicon: list of words, hash table of pointers to barrels


GOOGLE API Search requests: submit a query string and a set of

parameters to the Google Web APIs service and receive in return a set of search results

Cache requests:submit a URL to the Google Web APIs service and receive in return the contents of the URL when Google's crawlers last visited the page

Spelling requests: submit a query to the Google Web APIs service and receive in return a suggested spell correction for the query

C:\Villanova\Villanova Fall 2003\google api\googleapi\APIs_Reference.html


Web Page Freshness Web page "freshness" Junghoo Cho, Stanford, 2001.

– rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

How often do web pages change?


Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “PageRank”– contacted administrators

720,000 pages collected– 3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site

requests

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt


Average Change Intervalfr

actio

n of

pag

es

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1day 1day- 1week

1week-1month

1month-4months

4months

average change interval



Change Interval – By Domain

frac

tion

of p

ages

0

0.1

0.2

0.3

0.4

0.5

0.6

1day 1day- 1week

1week-1month

1month-4months

4months

comnetorgedugov

average change intervalrose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt


Refresh Strategy Crawlers can refresh only a certain

amount of pages in a period of time. The page download resource can be

allocated in many ways The proportional refresh policy allocated

the resource proportionally to the pages’ change rate.



Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!



Comparing PoliciesFreshness Age

Proportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month



Not Every Page is Equal!

e1

e2 Accessed by users 20 times/day

Accessed by users 10 times/day

Some pages are “more important”Some pages are “more important”

So weight pages by importance.



Allocating Freshness Resources

Visit more important pages more frequently

Visit pages according to a strategy which optimizes "freshness metric".


Additional Search Issues In addition to improved relevance and

freshness, can improve overall search with some other factors:– Eliminate duplicate documents– Eliminate multiple documents from one site– Provide good context– Clearly identify paid links