©2003 paula matuszek csc 9010: search engines google dr. paula matuszek (610) 270-6851

56
©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithklin e.com (610) 270-6851

Upload: cornelia-walton

Post on 06-Jan-2018

222 views

Category:

Documents


2 download

DESCRIPTION

©2003 Paula Matuszek Selecting Relevant Documents l Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents have been separated into individual files l Remaining components must parse, index, find, and rank documents. l Traditional approach is based on the words in the documents (predates the web)

TRANSCRIPT

Page 1: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

CSC 9010: Search EnginesGoogle

Dr. Paula [email protected]

(610) 270-6851

Page 2: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Search Engine Basics A spider or crawler starts at a web page, identifies all

links on it, and follows them to new web pages. A parser processes each web page and extracts

individual words. An indexer creates/updates a hash table which

connects words with documents A searcher uses the hash table to retrieve documents

based on words A ranking system decides the order in which to

present the documents: their relevance

Page 3: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Selecting Relevant Documents

Assume:– we already have a corpus of documents defined. – goal is to return a subset of those documents.– Individual documents have been separated into

individual files Remaining components must parse, index,

find, and rank documents. Traditional approach is based on the words in

the documents (predates the web)

Page 4: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Extracting Lexical Features Process a string of characters

– assemble characters into tokens (tokenizer)– choose tokens to index

In place (problem for www) Standard lexical analysis problem Lexical Analyser Generator, such as lex

Page 5: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Lexical Analyser Basic idea is a finite state machine Triples of input state, transition token, output

state

Must be very efficient; gets used a LOT

0

1

2

blankA-Z

A-Z

blank, EOF

Page 6: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Design Issues for Lexical Analyser

Punctuation– treat as whitespace?– treat as characters?– treat specially?

Case– fold?

Digits– assemble into numbers?– treat as characters?– treat as punctuation?

Page 7: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Lexical Analyser Output of lexical analyser is a string of

tokens Remaining operations are all on these

tokens We have already thrown away some

information; makes more efficient, but limits the power of our search

Page 8: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Stemming Additional processing at the token level Turn words into a canonical form:

– “cars” into “car”– “children” into “child”– “walked” into “walk”

Decreases the total number of different tokens to be processed

Decreases the precision of a search, but increases its recall

Page 9: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Stemming -- How? Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-

sensitive transformation grammar:– (.*)SSES -> /1SS

50-250 rules are typical

Page 10: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Noise Words (Stop Words) Function words that contribute little or

nothing to meaning Very frequent words

– If a word occurs in every document, it is not useful in choosing among documents

– However, need to be careful, because this is corpus-dependent

Often implemented as a discrete list

Page 11: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Example Corpora We are assuming a fixed corpus. Some sample

corpora:– Medline Abstracts– Email. Anyone’s email.– Reuters corpus– Brown corpus

Textual fields, structured attributes– Textual: free, unformatted, no meta-information– Structured: additional information beyond the content

Page 12: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Structured Atributes for Medline

Pubmed ID Author Year Keywords Journal

Page 13: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Textual Fields for Medline Abstract

– Reasonably complete standard academic English

– Capturing the basic meaning of document Title

– Short, formalized – Captures most critical part of meaning– Proxy for abstract

Page 14: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Structured Fields for Email To, From, Cc, Bcc Dates Content type Status Content length Subject (partially)

Page 15: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Text fields for Email Subject

– Format is structured, content is arbitrary. – Captures most critical part of content. – Proxy for content -- but may be inaccurate.

Body of email– Highly irregular, informal English. – Entire document, not summary. – Spelling and grammar irregularities. – Structure and length vary.

Page 16: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Indexing We have a tokenized, stemmed sequence of

words Next step is to parse document, extracting

index terms– Assume that each token is a word and we don’t

want to recognize any more complex structures than single words.

When all documents are processed, create index

Page 17: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Basic Indexing Algorithm For each document in the corpus

– get the next token– save the posting in a list

– doc ID, frequency For each token found in the corpus

– calculate #doc, total frequency– sort by frequency

Page 18: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Fine Points Dynamic Corpora: requires incremental

algorithms Higher-resolution data (eg, char position) Giving extra weight to proxy text (typically by

doubling or tripling frequency count) Document-type-specific processing

– In HTML, want to ignore tags– In email, maybe want to ignore quoted material

Page 19: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Choosing Keywords Don’t necessarily want to index on every

word– Takes more space for index– Takes more processing time– May not improve our resolving power

How do we choose keywords?– Manually– Statistically

Exhaustivity vs specificity

Page 20: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Manually Choosing Keywords

Unconstrained vocabulary: allow creator of document to choose whatever he/she wants– “best” match– captures new terms easily– easiest for person choosing keywords

Constrained vocabulary: hand-crafted ontologies– can include hierarchical and other relations– more consistent– easier for searching; possible “magic bullet” search

Page 21: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Examples of Constrained Vocabularies

ACM headings (www.acm.org/class/1998)– H: Information Retrieval

H3: Information Storage and Retrieval H3.3: Information Search and Retrieval

» Clustering» Query formulation » Relevance feedback » Search process etc.

Medline Headings (www.nlm.nih.gov/mesh/meshhome.html)– L: Information Science

L01: Information Science L01.700: Medical Informatics L01.700.508: Medical Informatics Applications L01.700.508.280: Information Storage and Retrieval

» Grateful Med [L01.700.508.280.400]

Page 22: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Automated Vocabulary Selection

Frequency: Zipf’s Law. – Pn = 1/na, where Pn is the frequency of occurrence of the nth

ranked item and a is close to 1

– Within one corpus, words with middle frequencies are typically “best”

Document-oriented representation bias: lots of keywords/document

Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents.

Page 23: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Choosing Keywords “Best” depends on actual use; if a word only

occurs in one document, may be very good for retrieving that document; not, however, very effective overall.

Words which have no resolving power within a corpus may be best choices across corpora

Not very important for web searching; will be more relevant for text mining.

Page 24: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Keyword Choice for WWW We don’t have a fixed corpus of documents New terms appear fairly regularly, and are

likely to be common search terms Queries that people want to make are wide-

ranging and unpredictable Therefore: can’t limit keywords, except

possibly to eliminate stop words. Even stop words are language-dependent. So

determine language first.

Page 25: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Comparing and Ranking Documents

Once our search engine has retrieved a set of documents, we may want to

Rank them by relevance – Which are the best fit to my query?– This involves determining what the query is about and

how well the document answers it Compare them

– Show me more like this.– This involves determining what the document is about.

Page 26: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Determining Relevance by Keyword

The typical web query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of

relatedness: how much does this document reflect what the query is about?

Simple strategies:– How many times does word occur in document?– How close to head of document?– If multiple keywords, how close together?

Page 27: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Keywords for Relevance Ranking

Count: repetition is an indication of emphasis– Very fast (usually in the index)– Reasonable heuristic– Unduly influenced by document length– Can be "stuffed" by web designers

Position: Lead paragraphs summarize content– Requires more computation– Also reasonably heuristic– Less influenced by document length– Harder to "stuff"; can only have a few keywords near

beginning

Page 28: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Keywords for Relevance Ranking

Proximity for multiple keywords– Requires even more computation– Obviously relevant only if have multiple keywords– Effectiveness of heuristic varies with information need;

typically either excellent or not very helpful at all– Very hard to "stuff"

All keyword methods– Are computationally simple and adequately fast– Are effective heuristics– typically perform as well as in-depth natural language

methods for standard search

Page 29: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Comparing Documents "Find me more like this one" really means that

we are using the document as a query. This requires that we have some conception

of what a document is about overall. Depends on context of query. We need to

– Characterize the entire content of this document– Discriminate between this document and others in

the corpus

Page 30: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Characterizing a Document: Term Frequency

A document can be treated as a sequence of words. Each word characterizes that document to some

extent. When we have eliminated stop words, the most

frequent words tend to be what the document is about Therefore: fkd (# of occurrences of word K in

document d) will be an important measure. Also called the term frequency

Page 31: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Characterizing a Document: Document Frequency

What makes this document distinct from others in the corpus?

The terms which discriminate best are not those which occur with high frequency!

Therefore: Dk (# of documents in which word K occurs) will also be an important measure.

Also called the document frequency

Page 32: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

TF*IDF This can all be summarized as:

– Words are best discriminators when they– occur often in this document (term frequency)– don’t occur in a lot of documents (document frequency)

One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency

There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

Page 33: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Describing an Entire Document

So what is a document about? TF*IDF: can simply list keywords in

order of their TF*IDF values Document is about all of them to some

degree: it is at some point in some vector space of meaning

Page 34: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge

space -- it is or is not about each of those terms. Consider each term as a vector. Then

– We have an n-dimensional vector space– Where n is the number of terms (very large!)– Each document is a point in that vector space

The document position in this vector space can be treated as what the document is about.

Page 35: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Similarity Between Documents

How similar are two documents?– Measures of association

– How much do the feature sets overlap?– Modified for length: DICE coefficient

DICE(x,y) = 2 f(x,y) / ( f(x) + f(y) ) # terms compared to intersection

– Simple Matching coefficient: take into account exclusions

– Cosine similarity– similarity of angle of the two document vectors– not sensitive to vector length

Page 36: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Bag of Words All of these techniques are what is

known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and

"dog bites man" non-existent Later we will discuss linguistic

approaches which pay attention to semantics

Page 37: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

GOOGLE Web-based search engine Don’t have a predefined corpus Have a very wide variety of documents Queries are typically short and simple Are not concerned with document

similiarity Are VERY concerned with relevance

Page 38: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Google Goals Relevance

– Techniques derived from Information Retrieval are focused on "Is this about my query?"

– BUT: web content is largely unflitered and unreferreed

– So relevance should also focus on "Is this a good web page?"

Scalable Improved search quality Academic research tool

Page 39: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Determining Relevance Page Rank Anchor Text (also allows indexing of

documents not spidered, such as images)

Proximity Font characteristics

Page 40: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Page Rank Unique characteristic of GOOGLE Citation graph of the web Focus is on importance of page, not on

about-ness.– NO: Is this what I asked about?– YES: How good is this page?

Still How well does this meet my information need?

Page 41: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Simple PageRank Model is of a surfer randomly searching

the web; page rank is probability that searcher will reach a specific page

Given by :

WhereB(i) : set of pages links to i.N(j) : number of outgoing links from j.

Page 42: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Page Rank, cont Can get "bogged down" in highly inter-

connected pages, so add a "damping factor"

"Bored surfer" model--searcher gets bored and moves to random new page :-).

Page 43: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Google Architecture

Page 44: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Architecture: Modules Crawlers (distributed) URL server Store Server Indexer Sorter URL Resolver DumpLexicon Searcher

Page 45: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Architecture: Data Stores Repository: compressed documents, with URL, docID,

length Barrel: for a range of wordIDs, holds list of wordIDs with

hit list, Doc IDs Anchor file: Link information, with source, target and

anchor text. Link database: pairs of docIDs Document Index: indexed by docID, contains status,

pointer to repository. Lexicon: list of words, hash table of pointers to barrels

Page 46: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

GOOGLE API Search requests: submit a query string and a set of

parameters to the Google Web APIs service and receive in return a set of search results

Cache requests:submit a URL to the Google Web APIs service and receive in return the contents of the URL when Google's crawlers last visited the page

Spelling requests: submit a query to the Google Web APIs service and receive in return a suggested spell correction for the query

C:\Villanova\Villanova Fall 2003\google api\googleapi\APIs_Reference.html

Page 47: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Web Page Freshness Web page "freshness" Junghoo Cho, Stanford, 2001.

– rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

How often do web pages change?

Page 48: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Experimental Setup February 17 to June 24, 1999 270 sites visited (with permission)

– identified 400 sites with highest “PageRank”– contacted administrators

720,000 pages collected– 3,000 pages from each site daily– start at root, visit breadth first (get new & old pages)– ran only 9pm - 6am, 10 seconds between site

requests

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 49: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Average Change Intervalfr

actio

n of

pag

es

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

1day 1day- 1week

1week-1month

1month-4months

4months

average change interval

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 50: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Change Interval – By Domain

frac

tion

of p

ages

0

0.1

0.2

0.3

0.4

0.5

0.6

1day 1day- 1week

1week-1month

1month-4months

4months

comnetorgedugov

average change intervalrose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 51: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Refresh Strategy Crawlers can refresh only a certain

amount of pages in a period of time. The page download resource can be

allocated in many ways The proportional refresh policy allocated

the resource proportionally to the pages’ change rate.

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 52: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Proportional Often Not Good!

Visit fast changing e1 get 1/2 day of freshness

Visit slow changing e2 get 1/2 week of freshness

Visiting e2 is a better deal!

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 53: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Comparing PoliciesFreshness Age

Proportional 0.12 400 daysUniform 0.57 5.6 daysOptimal 0.62 4.3 days

Based on Statistics from experimentand revisit frequency of every month

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 54: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Not Every Page is Equal!

e1

e2 Accessed by users 20 times/day

Accessed by users 10 times/day

Some pages are “more important”Some pages are “more important”

So weight pages by importance.

rose.cs.ucla.edu/~cho/talks/2001/UCLA.ppt

Page 55: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Allocating Freshness Resources

Visit more important pages more frequently

Visit pages according to a strategy which optimizes "freshness metric".

Page 56: ©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610) 270-6851

©2003 Paula Matuszek

Additional Search Issues In addition to improved relevance and

freshness, can improve overall search with some other factors:– Eliminate duplicate documents– Eliminate multiple documents from one site– Provide good context– Clearly identify paid links