search engine internal processes greg newby @ wth 2005 talk slides are available online:

Search Engine Internal

ProcessesGreg Newby @ WTH 2005

Talk slides are available online: http://petascale.org/presentations/wth/wth-search.ppt

Who is this guy? My academic research is mostly focused on

information retrieval (http://petascale.org/vita.html)

I’ve authored a research retrieval system I’m co-chair of the Global Grid Forum’s

working group on Grid Information Retrieval (GIR-WG), working to standardize distributed search

http://petascale.org/vita.html

About this presentation

We’re not focusing on how to get good ranking in search engines, or how to search effectively

Instead, we will look at the scientific and practical basis for search engines, from an “information science” point of view

What unanswered questions and open possibilities exist for enhanced information retrieval systems of the future?

Outline: The Science of Search

Basic language; recall and precisionBasic technologies: indexers, spiders and query processorsQuery optimization strategiesSpiders/harvestersIndexersWhere’s the action in search engines?

Basic language information Retrieval is the science, study and

practice of how humans seek information information seeking is complex human

behavior, in which some sort of cognitive change is sought

The nature of information is similarly complex. Does it exist apart from a human observer? Why is one person’s “data” another person’s “information?” Can we measure the information content of a message, or is that only for the telephone engineers (like Shannon & Weaver)?

Information retrieval (IR) systems attempt to match information seekers with information

Why is IR is hard: For all their performance, modern search is often

unsatisfying: finding the information you want is difficult.

(Many people are satisfied, though their results were poor)

IR systems use queries as expressions of information need. But such expressions are necessarily inexact:

human language is imprecise queries are usually short, but might represent complex needs a person’s history and background will impact which information is useful A document != information

More on why IR is hard The language of documents is imprecise.

Documents, or document extracts, or answers, or ... is what is presented by an IR system. long documents have many topics what is the “meaning” of a document? what it its information content? does the document type match the information need

type? I.e., answers for questions ... or, quick basic information for quick basic information needs …

Core concept: Relevance

Relevance is the core goal of an IR system Relevance is multifaceted, and includes:

useful timely pertinent accurate authoritative etc... as needed for a particular information seeker and her query

For evaluation, we think of relevance as binary: yes/no for a particular document’s relevance to a particular query

In reality, relevance is a difficult topic, and hard to measure accurately

Recall and Precision

How good is an IR system? Precision is the best measure for search engines: The proportion of retrieved documents that are

relevant.

Want perfect precision? Try retrieving just 1 document. If it’s relevant, precision is 100%!

Early high precision is a common approach of IR systems: present some quality documents first, in the hopes they satisfy the information seeker

Recall

Recall is the proportion of relevant documents that are retrieved.

Not so useful for Web search, where there are potentially very many relevant documents

For perfect recall: retrieve all documents, so recall = 100%!

Recall is appropriate for very complete or specialized searches or very small collections

Usually, Web search just looks at precision, especially precision @ some number (p@10)

Anatomy of a Web search engine

Harvesters (or spiders, or collection managers): these gather documents and prepare them for indexing

Indexers: the core IR system, that represents a collection for rapid retrieval

Query processors: front end to the index, to retrieve documents

Harvesters: to gather input

Lots of challenges: different document types duplicate documents; finding authoritative/master sites different languages dynamic content firewalls, passwords invalid HTML; frames not overloading harvested sites dealing with site requests for non-indexing bandwidth (to sites; to indexer) harvest schedule; retiring removed or inaccessible documents

Harvesters, continued Harvesting is complex, and largely

orthogonal to the rest of the IR system (i.e., the IR system doesn’t really care how difficult it was to get the documents ... it just indexes and retrieves them!)

Utilities such as htdig, wget and curl can be used for basic indexing, but more complete indexing is challenging

Indexers: the core IR system

Similar concepts and practice to DBMS Terms and documents are usually assigned

ID #s (for fixed-length fields); information about term frequency and position is kept, as well as weights for terms in documents.

Better query terms (more unique; better at distinguishing among documents) get higher weights in documents

Indexers, continued Documents are also weighted: better

documents should be ranked more highly. Google’s Page Rank is one way of measuring document quality, based on site authoritativeness

The challenge for indexers is to represent documents quickly and efficiently (input), but more importantly to enable rapid querying (output)

IR interaction

A user sends a query Conceptually, all documents in the/each

collection are evaluated against the query for relevance, based on a formula (In fact, only a small subset need to be ranked.

More in a minute…)

The top-ranked documents are presented to the user. If some of those documents are relevant, the search engine did a good job!

Query shortcuts IR systems can take Table lookup and caching: if a query

matches a known query, just return the prior results from a table/database – no need to run the query. Yes, this can be used to “hand tune” query results (i.e., human optimizers)

Algebra shortcuts: most forms of ranking only look at the occurrence of query terms. So, any document without any/all query terms is automatically considered non-relevant

Sample Basic Equations of IR tf is the weighted frequency of a term in a

document (“term frequency”) for term i in document j

idf is the inverse of the weight for term j in the collection (“inverse document frequency”)

The weighted relevance score of a query term i in a document j is: tf * idf

A Relevance Score A query has t terms (i.e., words). To get a

relevance score for an entire document j, we treat the query and document as vectors, normalize (take the vector norm) and compute the cosine (which is the dot product of the normalized vectors)

Cosine ranges from 0 to 1; 0 is orthogonal (“unrelated”), 1 is a perfect match. Rank in descending order and present results

Numerically… Imagine a two word query Document d1 has a weighted score of 1 for

term 1, and 2 for term 2: vector[1,2] Query terms are weighted vector[1,3] We first normalize the document to

vector[.45,.89], and query to vector[.34,.95] Then get cosine == .9985

Real-world Ranking is more Complex Sample for “LNU.LTC” weighting after Salton &

Buckley (1998) or http://www.cs.mu.oz.au/~aht/SMART2Q.html

LNU is term in document weight:log(tf{i,j}) + 1

---------------------------------------- Sqrt [ Sigma{1,t} (log(tf{i,k} + 1))^2 ]

LTC term in collection weight:2 x idf{j}

----------------------------------

Sqrt [ Sigma{1,t} (2 x idf{k})^2 ]

There are many variations… Simple Boolean retrieval Probabilistic retrieval Latent semantic indexing or information space (uses

eigensystems to measure term relations, rather than assume all terms are orthogonal)

Semantic-based representation (i.e., part of speech, document structure)

Link-based techniques (for HTML: use inlinke and outlinks to identify document topic & relations)

Many HTML tricks: use META, H1/title, etc. But the fundamental processes of weighting and ranking

almost always applies

How can Web search engines be really fast, with huge collections? Query optimization Fast building of candidate response set; fast

ranking of results This is mostly about engineering: parallelizing search;

storing data in memory; fast disk structures … More engineering: handle simultaneous

harvesting/updating, concurrent queries, and subset queries (i.e., local search or search within particular sites or domains)

How can Web search engines be really good? Be really fast Have effective measures for weights of terms,

weights of documents, and other factors: Site quality/authority; duplicate removal;

currentness … Query spelling check; good HTML parsing; non-

HTML document representation

Have features for better utility & usability

What do Information Scientists know about IR that could be useful? A lot about human information seeking Depth on the topic of relevance Aspects of documents (i.e., different types) and

queries (i.e., different needs) Techniques for:

Personalized search (training; standing queries) Searching streams of documents with standing queries

(filtering) Multi-lingual search; cross-language IR

For Further Study The Text Retrieval Conference (TREC) at

http://trec.nist.gov ; also DARPA’s TIDES program, and several internal conferences/competitions such as CLEF

Citeseer: there’s a lot of literature Try “information retrieval” in Google – read Robertson’s

book

Go forth and search: better, and more informed! There are lots of opportunities to improve IR

performance: it is not a solved problem

http://trec.nist.gov/