chapter 2modeling 資工 4b 86075800 陳建勳. introduction. traditional information retrieval...

24
Chapter 2 Modeling 資資 4B 86075800 資資資

Post on 20-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Chapter 2 Modeling

資工 4B 86075800陳建勳

Page 2: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Introduction.

Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword(or group of relatedwords) which has some meaning of its own(usually a noun).

Page 3: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

The advantage of using index terms

SimpleThe semantic of the documents and of the user information need can be naturally expressed through sets of index terms.

Ranking algorithms are at the core of information retrieval systems(predicting which documents are relevant and which are not).

Page 4: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

A taxonomy of information retrieval models

Retrieval:Ad hoc

Filtering

Classic Models

Browsing

USER

TASK

BooleanVector

Probabilistic

Structured Models

Non-overlapping listsProximal Nodes

FlatStructured Guided

Hypertext

Browsing

FuzzyExtended Boolean

Set Theoretic

AlgebraicGeneralized VectorLat. Semantic Index

Neural Networks

Inference NetworkBelief Network

Probabilistic

Page 5: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Index Terms Full Text Full Text+Structure

Retrieval ClassicSet TheoreticAlgebraicProbabilistic

ClassicSet TheoreticAlgebraicProbabilistic

Structured

Browsing Flat FlatHypertext

Structure Guided Hypertext

Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.

Page 6: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Retrieval : Ad hoc and Filtering

Ad hoc : The documents in the collection remain relatively static while new queries are submtted to the system.

Filtering : The queries remain relatively static while new documents come into the system

Page 7: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Filtering

Typically, the filtering task simply indicates to the user the documents which might be of interest to him.

Routing : Rank the filtering documents and show this ranking to the user.

Constructing user profiles in two ways.

Page 8: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

A formal characterization of IR models

D : A set composed of logical views(or representation) for the documents in the collection.Q : A set composed of logical views(or representation) for the user information needs(queries).F : A framework for modeling document representations, queries, and their relationships.R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.

Page 9: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Classic information retrieval model

Basic concepts : Each document is described by a set of representative keywords called index terms.

Assign a numerical weights to distinct relevance between index terms.

Page 10: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Define

ki : A generic index termK : The set of all index terms {k1,…,kt}wi,j : A weight associated with index term

ki of a document dj

gi : A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )

Page 11: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Boolean model

Based on a binary decision criterion without any notion of a grading scale.

Boolean expressions have precise semantics.It is not simple to translate an information need into a Boolean expression.

Can be represented as a disjunction of conjunction vectors(in disjunctive normal form-DNF).

Page 12: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Vector model

Assign non-binary weights to index terms in queries and in documents.

Compute the similarity between documents and query.

More precise than Boolean model.

Page 13: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

想法We think of the documents as a collection

C of objects and think of the user query as a specification of a set A of objects.In this scenario, the IR problem can be reduced to the problem of determine which documents are in the set A and which ones are not(i.e., the IR problem can be viewed as a clustering problem).

Page 14: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Intra-cluster : One needs to determine what are the features which better describe the objects in the set A.

Inter-cluster : One needs to determine what are the features which better distinguish the objects in the set A.

Page 15: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

tf : inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents.

idf : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.

Page 16: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Vector model is simple and fast. It’s a popular retrieval model.

Disadvantage : Index terms are assumed to be mutually independent. It doesn’t account for index term dependencies.

Page 17: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Probabilistic model

We can think of the querying process as a process of specifying the properties of an ideal answer set(The problem is that we do not know exactly what these properties are.).

Page 18: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Structured text retrieval model

Retrieval models which combine information on text content with information on the document structure are called structured text retrieval model.

Match point : refer to the position in the text of a sequence of words which matches the user query.

Region : refer to a contiguous portion of the text.

Node : refer to a structural component of the document such as a chapter, a section, a subsection.

Page 19: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Model based on Non-overlapping lists

Divide the whole text of each document in non-overlapping text regions which are collected in a list.

Text regions in the same list have no overlapping, but text regions from distinct lists might overlap.

Page 20: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Model based on Proximal nodes

A model which allows the definition of independent hierarchical indexing structures over the same document text.

Each of these index structures is a strict hierarchy composed of chapters, sections, paragraphs, pages, and lines which called nodes.

Page 21: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Models for browsing

Flat browsing

Structure guided browsing

The hypertext model

Page 22: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Flat browsing

The documents might be represented as dots in a plan or as elements in a list.Relevance feedbackDisadvantage : In a given page or screen there may not be any indication about the context where the user is.

Page 23: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

Structure guided browsing

Organized in a directory structure. It groups documents covering related topics.The same idea can be applied to a single document.Using history map.

Page 24: Chapter 2Modeling 資工 4B 86075800 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents

The hypertext modelWritten text is usually conceived to be read sequentially.

The reader should not expect to fully understand the message conveyed by the writer by randomly reading pieces of text here and there.