information retrieval system-chapter-1

23
Information Retrieval System Chapter-1 1

Upload: shahnawaz-husain

Post on 19-Dec-2015

31 views

Category:

Documents


3 download

DESCRIPTION

Introduction of IRS system

TRANSCRIPT

Information Retrieval System

Information Retrieval SystemChapter-111IRSInformation retrieval (IR) deals with the representation, storage, organization, and access to information items.

Information retrieval (IR) is the process of finding relevant documents that satisfies information need of users from large collections of unstructured text.2General Goal of Information Retrieval

1. To help users find useful information based on their information needs (with a minimum effort), despite Increasing complexity of Information: Whatever the shape(structured or unstructured), size of documents corpus, distribution of documents.Changing needs of user: User may search the documents with different text and name.2. Provide immediate random access to the document collection (Efficient searching).3IRS Design/Architecture

4Web Search System(For general users IRS system is included with Web e.g. Search Engines)

5Data retrieval is DBMS system which is owned by the specific organization.

6

7Indexing is done at the time of storage (organization of infrormation).8

A Formal Characterization of IR Models

9Boolean ModelThe Boolean model is a simple retrieval model based on set theory and Boolean algebra.the Boolean model provides a framework which is easy to grasp by a common user of an IR system.the queries are specified as Boolean expressions.The result is relevant document produced or not.10

11Draw Backs of Boolean ModelFirst, its retrieval strategy is based on a binary decision criterion (a document is predicted to be either relevant or non-relevant)the Boolean model is in reality much more a data (instead of information) retrieval model.Most of the users find it difficult to express their query requests in terms of Boolean expressions.12Vector ModelProposes a framework in which partial matching is possible. This is done by assigning non-binary weights to index terms in queries and in documents.These term weights are used to compute the degree of similarity between each document stored in the system and the user query.Finally Sorting the documents retrieval in decreasing order in terms of degree of similarity.13

14The Degree of similarity can be calculated as:

15

16Advantages of Vector modelIts term (weight) scheme improves retrieval performance.

(2) Its cosine ranking formula sorts the documents according to their degree of similarity to the query.17Disadvantageous of Vector modelTheoretically: that index terms are assumed to be mutually independent.18Probabilistic Model (BIR)Probabilistic model introduced in 1976 by Roberston and Sparck Jones. Which later became known as the binary independence retrieval (BIR) model.The probabilistic model attempts to capture the IR problem within a probabilistic framework.19Fundamental idea of Probabilistic ModelGiven a user query, there is a set of documents which contains exactly the relevant documents and no other. (This is called as ideal answer set).The problem is that we do not know exactly what these properties are.Since these properties are not known at query time, an effort has to be made at initially guessing what they could be.

20The initial guess allows us to generate a preliminary probabilistic ideal answer set which is used to retrieve a first set of documents.An interaction with the user is then initiated with the purpose of improving the probabilistic ideal answer set.On the basis of iterated feedback ideal answer set will be generated.21The degree of similarity on probabilistic model

Where ni = is number of documents with keyword K N is total number of documents 22Cluster Based Retrieval modelCluster-based retrieval has as its foundation the cluster hypothesis. It states Closely associated documents tend to be relevant to the same requests.Clustering picks out closely associated documents and groups them together into one cluster.For each cluster there will be one cluster representative (C.R).Each C.R holds the weight. Which further helps to search.23