search engine architecture 1

8/13/2019 Search Engine Architecture 1

1/23

Search Engine Architecture I


2/23

Software Architecture

The high level structure of a softwaresystem

Software components The interfaces provided by those

components

The relationships between thosecomponents

J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 2


3/23

UIMA

An architecture to provide a standard forintegrating search and related languagetechnology components

Unstructured Information ManagementArchitecture (www.research.ibm.com/UIMA)

Defining interfaces for components tosimplify the addition of new technologiesinto systems that handle text and otherunstructured data



4/23


UIMA

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif


5/23

A Good Reference



6/23


Primary Goals of Search Engines

Effectiveness (quality): to retrieve the mostrelevant set of documents for a query Process text and store text statistics to improve

relevance

Efficiency (speed): process queries from users asfast as possible Use specialized data structures

Specific goals usually fall into the above primarygoals Example: handling changing document collections

both an effectiveness issue and an efficiency issue


7/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 7

Two Major Functions

Search engine components support twomajor functions

The index process: building data structuresthat enable searching

The query process: using those datastructures to produce a ranked list ofdocuments for a users query



The Indexing Process



Text Acquisition

Identifying and making available the documentsthat will be searched

How? Crawling or scanning the web, a corporate intranet, or

other sources of information

Building a document data store containing the text andmetadata for all the documents

Metadata: document type, document structure, documentlength,


10/23



Document Feeds

A mechanism for accessing a real-timestream of documents

RSS: a common standard used for webfeeds for content such as news, blogs, orvideo

An RSS reader subscribes to RSS feeds, andprovides new content when it arrives

RSS feeds are formatted in XML


12/23



Document Data Stores

A simple database to manage large numbers ofdocuments and structured data

Document components are typically stored in acompressed form for efficiency

Structured data consists of document metadataand other information extracted from thedocuments such as links and anchor text

[Discussion] Why do we need document datastores local to search engines?

The original documents are available on the web


14/23




15/23


Text Transformation / Index Creation

Text transformation: transforming documents intoindex terms or features Index terms: the parts of a document that are stored in

the index and used in searching

Features: parts of a text document that is used torepresent its content

Examples: phrases, names, dates, links, Index vocabulary: the set of all the terms that are

indexed for a document collection Index creation: creating the indexes or data

structures Example: building inverted list indexes


16/23


Parsers, Stopping, and Stemming

Processing the sequence of text tokens in thedocument to recognize structural elements suchas titles, figures, links, and headings Tokenization: identifying units to be indexed Using syntax of markup languages to identify structures

Stopping: removing common words from thestream of tokens, e.g., the, of, to, Reducing index size considerably

Stemming: group words that are derived from acommon stem Example: fish, fishes, fishing Increase the likelihood that words used in queries and

documents will match


17/23


Other Text Transformation Tasks

Link extraction and analysis Links can be indexed separately from the general text content Using link analysis algorithms, e.g., PageRank, to quantify page

popularity and find authority pages

Using anchor text to enhance the text content of a page that the linkpoints to

Information extraction: identifying index terms that aremore complex than single words Entity identification, e.g., finding names

Classifiers: identifying class-related metadata fordocuments or parts of documents Example: finding spam documents and non-content parts of

documents (e.g., ads)

Alternatively, clustering related documents


18/23




19/23


Collecting Document Statistics

Gathering and recording statistical informationabout words, features, and documents Statistics will be used to compute scores of documents Stored in lookup tables

Examples Counts of index term occurrences Positions in the documents where the index terms

occurred

Counts of occurrences over groups of documents Lengths of documents in terms of the number of tokens

Actual data depends on the retrieval model andthe associated ranking method


20/23


Weighting

Calculating index term weights using thedocument statistics and storing them in lookuptables Pre-computation can improve query answering

efficiency

TF/IDF weighting TF (the term frequency): the frequency of index term

occurrences in a document

IDF (inverse document frequency): the inverse of thefrequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documentscontaining a particular term)


21/23


Inversion and Index Distribution

Inversion: changing the stream of document-terminformation coming from the text transformationcomponent into term-document information for thecreation of inverted indexes Core of the indexing process The number of documents is large The indexes are updated with new documents from

feeds and crawls, and are often compressed for high

efficiency Index distribution: distributing indexes across

multiple computers/sites on a network Document distribution, term distribution, and replication


22/23


Summary

A high-level description of search enginesoftware architecture

The indexing process Building blocks and their functionalities


23/23

To-Do List

Expand the figures of the indexing processto include the detailed functionalities

Reach Chapter 2.1-2.3.3

J Pei: Information Retrieval and Web Search -- Search Engine Architecture 23

search engine architecture 1

Documents