information retrieval models - 1 boolean. introduction ir systems usually adopt index terms to...

12
Information Retrieval Models - 1 Boolean

Upload: randolph-sims

Post on 01-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Information Retrieval

Models - 1Boolean

Page 2: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Introduction

• IR systems usually adopt index terms to process queries

• Index terms: A keyword or group of selected words Any word (more general)

• Stemming might be used: Connect: connecting, connection, connections, connected

• An inverted file is built for the chosen index terms

Page 3: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Introduction

• Matching a query to documents based on index terms is imprecise … so it’s no surprise users can get unsatisfactory results.

• How much training do end-users typically have? As a result, they’re frustrated with web results, too

• Need to locate but also rank documents, based on the concept of relevancy.

Page 4: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Introduction

• A ranking is an ordering of the documents retrieved that reflect the relevance of the documents to the user (thru the query)

• Ranking is based on fundamental premises regarding the notion of relevancy, such as Common sets of index terms Sharing of weighted terms Likelihood of relevance

• Each set of premises leads to distinct IR models

Page 5: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Boolean Retrieval

• Index terms are either present or absent: no middle ground

• The weights are either 0 (not present) or 1 (present), represented in set theory wi,j {0,1}

• In IR, relevancy is considered as a degree of similarity between a document (or set of documents) and the query’s term (or terms) Sim(dj, q) Similarity of document #j to query q)

Page 6: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Boolean Sets

Demo on board

Page 7: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Boolean Retrieval

• Boolean model is better suited for data retrieval; compare the SQL query “list * from libraryDB where author=‘Smith’”

• Question: What about a lot of matches? Distinguish between matches (author=“smith” and title=“Learning Swedish”)

Can we use the binary model and modify it for ranking?

• Alternatives? [You bet!]

Page 8: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Page 9: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

IR ModelsThe IR model, the logical view of the docs, and the retrieval task are distinct aspects of the system

Index Terms Full Text Full Text +Structure

RetrievalClassic

Set TheoreticAlgebraic

Probabilistic

ClassicSet Theoretic

AlgebraicProbabilistic

Structured

Browsing FlatFlat

HypertextStructure Guided

Hypertext

LOGICAL VIEW OF DOCUMENTS

USER

TASK

Page 10: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Basic Concepts: Classic IR Models

• Inherent properties of documents: words, aka keywords*, aka index terms

• Represent the document through “sets of keywords” (or index terms; the main themes)

• Use nouns because nouns are believed to carry the most (semantic) meaning

• Search engines, however, assume that all words are index terms (“full text representation”)

Page 11: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Classic IR Models - Basic Concepts

• Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of docs.

• The importance of the index terms is represented by weights Recall the Boolean models {0,1} All other models use a value between {0..1}

Degrees of similarity

Page 12: Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected

Classic IR Models - Basic Concepts

• Let ki be an index term, dj be a document, wi,j is a weight associated with (ki, dj)

• The weight wij quantifies the importance of the index term for describing the document contents.