search engine architecture 1

Upload: aadafull

Post on 04-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Search Engine Architecture 1

    1/23

    Search Engine Architecture I

  • 8/13/2019 Search Engine Architecture 1

    2/23

    Software Architecture

    The high level structure of a softwaresystem

    Software components The interfaces provided by those

    components

    The relationships between thosecomponents

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 2

  • 8/13/2019 Search Engine Architecture 1

    3/23

    UIMA

    An architecture to provide a standard forintegrating search and related languagetechnology components

    Unstructured Information ManagementArchitecture (www.research.ibm.com/UIMA)

    Defining interfaces for components tosimplify the addition of new technologiesinto systems that handle text and otherunstructured data

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3

  • 8/13/2019 Search Engine Architecture 1

    4/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 4

    UIMA

    http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architectureHighlights.html/$FILE/blockDiagram.gif

  • 8/13/2019 Search Engine Architecture 1

    5/23

    A Good Reference

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 5

  • 8/13/2019 Search Engine Architecture 1

    6/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6

    Primary Goals of Search Engines

    Effectiveness (quality): to retrieve the mostrelevant set of documents for a query Process text and store text statistics to improve

    relevance

    Efficiency (speed): process queries from users asfast as possible Use specialized data structures

    Specific goals usually fall into the above primarygoals Example: handling changing document collections

    both an effectiveness issue and an efficiency issue

  • 8/13/2019 Search Engine Architecture 1

    7/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 7

    Two Major Functions

    Search engine components support twomajor functions

    The index process: building data structuresthat enable searching

    The query process: using those datastructures to produce a ranked list ofdocuments for a users query

  • 8/13/2019 Search Engine Architecture 1

    8/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 8

    The Indexing Process

  • 8/13/2019 Search Engine Architecture 1

    9/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 9

    Text Acquisition

    Identifying and making available the documentsthat will be searched

    How? Crawling or scanning the web, a corporate intranet, or

    other sources of information

    Building a document data store containing the text andmetadata for all the documents

    Metadata: document type, document structure, documentlength,

  • 8/13/2019 Search Engine Architecture 1

    10/23

  • 8/13/2019 Search Engine Architecture 1

    11/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 11

    Document Feeds

    A mechanism for accessing a real-timestream of documents

    RSS: a common standard used for webfeeds for content such as news, blogs, orvideo

    An RSS reader subscribes to RSS feeds, andprovides new content when it arrives

    RSS feeds are formatted in XML

  • 8/13/2019 Search Engine Architecture 1

    12/23

  • 8/13/2019 Search Engine Architecture 1

    13/23J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13

    Document Data Stores

    A simple database to manage large numbers ofdocuments and structured data

    Document components are typically stored in acompressed form for efficiency

    Structured data consists of document metadataand other information extracted from thedocuments such as links and anchor text

    [Discussion] Why do we need document datastores local to search engines?

    The original documents are available on the web

  • 8/13/2019 Search Engine Architecture 1

    14/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 14

    The Indexing Process

  • 8/13/2019 Search Engine Architecture 1

    15/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15

    Text Transformation / Index Creation

    Text transformation: transforming documents intoindex terms or features Index terms: the parts of a document that are stored in

    the index and used in searching

    Features: parts of a text document that is used torepresent its content

    Examples: phrases, names, dates, links, Index vocabulary: the set of all the terms that are

    indexed for a document collection Index creation: creating the indexes or data

    structures Example: building inverted list indexes

  • 8/13/2019 Search Engine Architecture 1

    16/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16

    Parsers, Stopping, and Stemming

    Processing the sequence of text tokens in thedocument to recognize structural elements suchas titles, figures, links, and headings Tokenization: identifying units to be indexed Using syntax of markup languages to identify structures

    Stopping: removing common words from thestream of tokens, e.g., the, of, to, Reducing index size considerably

    Stemming: group words that are derived from acommon stem Example: fish, fishes, fishing Increase the likelihood that words used in queries and

    documents will match

  • 8/13/2019 Search Engine Architecture 1

    17/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17

    Other Text Transformation Tasks

    Link extraction and analysis Links can be indexed separately from the general text content Using link analysis algorithms, e.g., PageRank, to quantify page

    popularity and find authority pages

    Using anchor text to enhance the text content of a page that the linkpoints to

    Information extraction: identifying index terms that aremore complex than single words Entity identification, e.g., finding names

    Classifiers: identifying class-related metadata fordocuments or parts of documents Example: finding spam documents and non-content parts of

    documents (e.g., ads)

    Alternatively, clustering related documents

  • 8/13/2019 Search Engine Architecture 1

    18/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 18

    The Indexing Process

  • 8/13/2019 Search Engine Architecture 1

    19/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19

    Collecting Document Statistics

    Gathering and recording statistical informationabout words, features, and documents Statistics will be used to compute scores of documents Stored in lookup tables

    Examples Counts of index term occurrences Positions in the documents where the index terms

    occurred

    Counts of occurrences over groups of documents Lengths of documents in terms of the number of tokens

    Actual data depends on the retrieval model andthe associated ranking method

  • 8/13/2019 Search Engine Architecture 1

    20/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20

    Weighting

    Calculating index term weights using thedocument statistics and storing them in lookuptables Pre-computation can improve query answering

    efficiency

    TF/IDF weighting TF (the term frequency): the frequency of index term

    occurrences in a document

    IDF (inverse document frequency): the inverse of thefrequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documentscontaining a particular term)

  • 8/13/2019 Search Engine Architecture 1

    21/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21

    Inversion and Index Distribution

    Inversion: changing the stream of document-terminformation coming from the text transformationcomponent into term-document information for thecreation of inverted indexes Core of the indexing process The number of documents is large The indexes are updated with new documents from

    feeds and crawls, and are often compressed for high

    efficiency Index distribution: distributing indexes across

    multiple computers/sites on a network Document distribution, term distribution, and replication

  • 8/13/2019 Search Engine Architecture 1

    22/23

    J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 22

    Summary

    A high-level description of search enginesoftware architecture

    The indexing process Building blocks and their functionalities

  • 8/13/2019 Search Engine Architecture 1

    23/23

    To-Do List

    Expand the figures of the indexing processto include the detailed functionalities

    Reach Chapter 2.1-2.3.3

    J Pei: Information Retrieval and Web Search -- Search Engine Architecture 23