information retrieval models

6

Click here to load reader

Upload: hellenndegwa

Post on 20-Nov-2015

7 views

Category:

Documents


2 download

DESCRIPTION

Information Retrieval Models

TRANSCRIPT

Major Information Retrieval ModelsThe following major models have been developed to retrieve information: theBooleanmodel, theStatisticalmodel, which includes the vector space and the probabilistic retrieval model, and the Linguistic and Knowledge-basedmodels. The first model is often referred to as the "exact match" model; the latter ones as the "best match" models The Boolean ModelBased on set theory and the Boolean algebraHistorically the most common model used in Library OPACs, Dialog system and Many web search engines, tooDocuments are sets of termsQueries are Boolean expressions on termsBoolean logic allows a user to logically relate multiple concepts together to define what information is needed. Typically the Boolean functions apply to processing tokens identified anywhere within an item. The typical Boolean operators are AND, OR, and NOT. These operations are implemented using set intersection, set union and set difference procedures.ANDrequires both terms to be in each item returned. If one term is contained in the document and the other is not, the item is not included in the resulting list. (Narrows the search) Example: A search on stock market AND trading includes results contains: stock market trading; trading on the stock market; and trading on the late afternoon stock market OReither term (or both) will be in the returned document. (Broadens the search) Example: A search on ecology OR pollution includes results contains: documents containing the world ecology (but not pollution) and other documents containing the word pollution (but not ecology) as well as documents with ecology and pollution in either order or number of uses. NOT or AND NOT ( dependent upon the coding of the database's search engine)the first term is searched, then any records containing the term after the operators are subtracted from the results.Parentheses will help you group and order a mixture of Boolean operators: e.g (mouse OR rat OR mice) AND cats , ((mouse OR rat) AND trap) OR mousetrap Nested parenthesis Innermost parenthetical group is processed first.ProximityProximity operators vary by database. There are two standard typesNear Finds words within a certain number of each other, regardless of word order. Within Finds words within a certain number of each other in the order you specifyJSTOR allows you to find terms that are within a specific number of words of each other using the tilde (~) as a proximity operator. For example, to search for an item with the terms debt and forgiveness within ten words of each other, you would construct the following query: "debt forgiveness"~10Contiguous Word PhrasesA Contiguous Word Phrase (CWP) is both a way of specifying a query term and a special search operator. A Contiguous Word Phrase is two or more words that are treated as a single semantic unit. An example of a CWP is United States of America. It is four words that specify a search term representing a single specific semantic concept (a country)Fuzzy SearchesFuzzy Searches provide the capability to locate spellings of words that are similar to the entered search term. This function is primarily used to compensate for errors in spelling of words. Fuzzy searching increases recall at the expense of decreasing precision (i.e., it can erroneously identify terms as the search term).TheStatisticalmodelsVector space modelIn the vector space model, the assumption is made that the stored records (documents) and the information requests are represented by sets of assigned keywords or index terms. This implies that queries and documents can be modeled by term vectors of the form Dj = (aj1, aj2 ,... ajt) Qj = (qj1,qj2, qjt )where t is the number of distinct index terms available in the system, and a.jk and qjk represent the values of term k in document Di or query Qi, respectively. Typically, aik Or (qjk. ) Might be set equal to 1 when term k appear; in document Di (or in query Qj), and to 0 if the term is absent from the vector. Alternatively, the vector coefficients could take on numerical values, the size of each coefficient depending on the importance of the term in the respective document or query. The vector space model is known to be advantageous for a variety of reasons: a) The similarity between term vectors is easily computed, based on the similarities between the term assignments to the corresponding vectors. Similarity coefficients can then be generated between queries and documents for information retrieval, or between different document vectors for document clustering purposes. b) When the documents are arranged in decreasing order of query document similarity, a ranking of the documents becomes available, and documents can be retrieved in decreasing order of query-document similarity. A document ranking feature improves the interaction between users and system during the retrieval process. c) The vector system, the document vectors are easily modified either by addition of new terms and removal of old terms, or by suitable alterations in the term weights. This vector modification process is especially useful in query vectorAdvantagesThe vector space model has the following advantages over the Standard Boolean model: Simple model based on linear algebra Term weights not binary Allows computing a continuous degree of similarity between queries and documents Allows ranking documents according to their possible relevance Allows partial matching need explanation for finding correlationLimitationsThe vector space model has the following limitations: Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) Search keywords must precisely match document terms; word substrings might result in a "false positive match" Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". The order in which the terms appear in the document is lost in the vector space representation. Assumes terms are statistically independent Weighting is intuitive but not very formalMany of these difficulties can, however, be overcome by the integration of various tools, including mathematical techniques such as singular value decomposition and lexical databases such as WordNet.

The probabilistic models The probabilistic models were first introduced in the early 1960s and represent an attempt to put the retrieval operations on a sound theoretical basis. The basic premise is that a document should be retrieved if its probability of relevance to the user's needs exceeds the probability of non-relevance. The probabilistic approach thus introduces the notion of relevance and non-relevance of a document which is absent from the vector and Boolean models. This renders necessary the distinction of term characteristics in the relevant and non-relevant portions of a collection. The main attraction of the probabilistic models is that in principle a large number of phenomena about terms and their occurrence characteristics may be taken into account, including for example term co-occurrences for any subset of terms; term relationship indications derived, for example, from existing semantic nets or other constructs used in artificial intelligence approaches; historical knowledge about how well certain terms may have done previously in retrieving relevant information in response to similar information needs; information about term meaning and term relationships derived from dictionaries and thesauruses; and any prior knowledge about the occurrence distribution of terms in certain parts of the collection. Because the probabilistic model can accommodate all this intelligence about documents and queries, it offers the promise of vastly greater effectiveness than the basic vector and Boolean models. Weaknessesa) Because the probabilistic system is not based on existing initial query formulations, the opportunity of independent weighting of query and document terms that exists in the vector system is lost in the probabilistic environment. b) ln the normal relevance feedback approach, the initial query terms are considered to be crucially important. Since initial query terms are not available in the probabilistic system, a probabilistic relevance feedback operation may produce inferior results. c) The probabilistic approach can incorporate unspecified term dependencies; no distinction is made, however, between different types of dependencies of the kind implicitly specified in the Boolean model (where term synonyms are expressed by or-operators, and term phrases by and-operators). In practice, a completely parallel treatment of very different classes of term dependencies may not produce useful retrieval results. d) Some objective measurements that are routinely used in a vector system, such as the number of terms attached to a document, or the sum of the weights of the document terms, are excluded from the existing probabilistic approaches.