information retrieval models - 1 boolean. introduction ir systems usually adopt index terms to...
TRANSCRIPT
Information Retrieval
Models - 1Boolean
Introduction
• IR systems usually adopt index terms to process queries
• Index terms: A keyword or group of selected words Any word (more general)
• Stemming might be used: Connect: connecting, connection, connections, connected
• An inverted file is built for the chosen index terms
Introduction
• Matching a query to documents based on index terms is imprecise … so it’s no surprise users can get unsatisfactory results.
• How much training do end-users typically have? As a result, they’re frustrated with web results, too
• Need to locate but also rank documents, based on the concept of relevancy.
Introduction
• A ranking is an ordering of the documents retrieved that reflect the relevance of the documents to the user (thru the query)
• Ranking is based on fundamental premises regarding the notion of relevancy, such as Common sets of index terms Sharing of weighted terms Likelihood of relevance
• Each set of premises leads to distinct IR models
Boolean Retrieval
• Index terms are either present or absent: no middle ground
• The weights are either 0 (not present) or 1 (present), represented in set theory wi,j {0,1}
• In IR, relevancy is considered as a degree of similarity between a document (or set of documents) and the query’s term (or terms) Sim(dj, q) Similarity of document #j to query q)
Boolean Sets
Demo on board
Boolean Retrieval
• Boolean model is better suited for data retrieval; compare the SQL query “list * from libraryDB where author=‘Smith’”
• Question: What about a lot of matches? Distinguish between matches (author=“smith” and title=“Learning Swedish”)
Can we use the binary model and modify it for ranking?
• Alternatives? [You bet!]
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
boolean vector probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
IR ModelsThe IR model, the logical view of the docs, and the retrieval task are distinct aspects of the system
Index Terms Full Text Full Text +Structure
RetrievalClassic
Set TheoreticAlgebraic
Probabilistic
ClassicSet Theoretic
AlgebraicProbabilistic
Structured
Browsing FlatFlat
HypertextStructure Guided
Hypertext
LOGICAL VIEW OF DOCUMENTS
USER
TASK
Basic Concepts: Classic IR Models
• Inherent properties of documents: words, aka keywords*, aka index terms
• Represent the document through “sets of keywords” (or index terms; the main themes)
• Use nouns because nouns are believed to carry the most (semantic) meaning
• Search engines, however, assume that all words are index terms (“full text representation”)
Classic IR Models - Basic Concepts
• Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of docs.
• The importance of the index terms is represented by weights Recall the Boolean models {0,1} All other models use a value between {0..1}
Degrees of similarity
Classic IR Models - Basic Concepts
• Let ki be an index term, dj be a document, wi,j is a weight associated with (ki, dj)
• The weight wij quantifies the importance of the index term for describing the document contents.