chapter 4 matching process
DESCRIPTION
Chapter 4 Matching Process. Matching Process. query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다 query 에 여러 단어가 있을 수 있다 그 단어가 문맥상 중요하지 않을 수 있다 e.g. “This document is not about...” 이 장에서는 문서가 query 에 불확실하게 match 되는 것으로 가정 관련도( relevance) 가 얼마나 강한지에 촛점 Topicality of document 에 촛점 - PowerPoint PPT PresentationTRANSCRIPT
Chapter 4 Matching Process
Matching Process
query 의 한 단어가 문서에 나타났다고 해서 반드시 검색되어야 하는 것은 아니다 query 에 여러 단어가 있을 수 있다 그 단어가 문맥상 중요하지 않을 수 있다
e.g. “This document is not about...” 이 장에서는 문서가 query 에 불확실하게
match 되는 것으로 가정 관련도 (relevance) 가 얼마나 강한지에 촛점
Topicality of document 에 촛점 문서의 topic 과 query 의 topic 의이 일치하는 정도 사용자의 지식과 배경 및 선호도 : 6 장
4.1 Relevance and Similarity Measure
document space: organized set of document document space doesn’t contain queries
mapping from the document space into the query space (Boolean systems)
characteristic function having the value on documents relevance to the query: [0, 1]
document space contains queries query is a point in the document space relevant documents: a cluster near the query point evaluation function: define a contour
measure basis for evaluation of each document some computable function
measure
whether document is relevant to a query as relevance is ultimately in the mind of the
user, it is difficult to measure directly IR systems rely primarily on measurements
from document and query representation most systems equate relevance with lexical sim
ilarity- matching of words
4.2 Boolean-Based Matching whether containing a given term query is a logical function of given words,
document is not. 구조적 유사성이 없슴 : characteristic function
no basis for the development of significant similarity judgments.-satisfy query or not. 수정 사례 : ‘A OR B OR C’ 의 결과에 grade
Since Boolean systems operate on the basis of the presence or absence of terms, many such systems do not include the term frequency data.
4.3 Vector-Based Matching: Metrics
metrics: distance measure & angular measure distance measure
벡터 공간에서 가까우면 유사하다는 가정 angular measure
벡터 공간에서 비슷한 방향에 있으면 유사하다는 가정 distance of a document from itself is 0.
not similarity measure, but dissimilarity measure 변환이 필요
linear conversion from a metric to a similarity measure is generally not desirable. metric 에 대한 변환 를 = k - 로 할 경우 적절한 k
값의 선정이 어렵다
4.3 Vector-Based Matching: Metrics
inversion transform( 역변환 ) that maps the distance into fixed positve range of numbers
, b>1, P() 는 단조증가
4.4 Vector-Based Matching: Cosine Measure
this is not a distance measure, but an angular measure.
where tk is the value of term k in the document and qk is its value in the query
this is inner product of the document and query vectors, normalized by their lengths.
Measure comparison
distance measures Similarity depends only on how far a given
document is from the point Angular mesures
not consider the distance of each document from the origin, but only the direction
two documents that lie along the same vector from the origin will be judged identically, despite the fact that they may be far apart in the document space.
Measure comparison
ex) D1=<1, 3>, D2=<100, 300>, D3=<3, 1> consine measure
(D1, D2) = 1.0, (D1, D3) = 0.6
euclidean distance (D1, D2) = 314.96, (D1, D3) = 2.83
consine measure 는 D1 과 D2 가 더 유사한 것으로 보고 distance measure 는 D1 과 D3 이 더 유사한 것으로 본다
In practice, distance and angular measures seem to give results of similar quality sufficiently far from the origin
4.5 Missing Terms and Term Relationship
one problem - missing term 0 은 2 가지 의미 : no occurrence, no information
of occurrence (e.g. <3, 0>, <0, 4>) it may be that a term is missing from a document
description because an indexer did not think it significant, rather than because it does not occur in the document. - also missing from a query by user.
4.5 Missing Terms and Term Relationship
Another problem - term relationship vector 연산 – 각 원소가 서로 독립임을 가정 잘못된 결과 발생 가능성 : e.g. “digital
computer” Final problem – symmetricity
distance and angular measure 는 모두 query 와 document 에 대해 대칭적인 관점을 유지
사용자는 query 에 맞는 document 를 원하지만 document 에 맞는 query 를 원하지는 않는다
e.g. 백과사전 : 사용자 query 에 해당하는 항목에는 query 에 나타나지 않는 단어가 매우 많이 존재
4.6 Probabilistic Matching
focus attention on models that include uncertainties more directly
to calculate the probability that the document is relevant to the query
assumption at any given time a sigle query is being used the number of documents within the database that
are relevant to the query is known
4.6 Probabilistic Matching 무작위 (random) 로 문서를 선택할 때의 확률
P(rel) = n/N P( ㄱ rel) = 1- P(rel) = (N-n)/N 실제로는 query 와 document 의 단어를 비교하여 선택
P(computer|digital) > P(computer|?) 사례 1
선택된 어떤 문서 집합 S 의 모든 문서에 대해 , P(rel|selected) > P( ㄱ rel|selected) 이면 relevant
Discriminant function
dis(selected)=
어떤 집합의 모든 문서에 대해 dis(selected)>1 이면 그 집합 을 검색
4.6 Probabilistic Matching 사례 2
조건 : 관련 확률이 무관련 확률의 3 배 초과 P(rel|selected) > 3 P( ㄱ rel|selected) P(rel|selected) > 3 (1 - P(rel|selected)) P(rel|selected) + 3 P(rel|selected) > 3 P(rel|selected) > 0.75
discrimination function criterion is then, dis(selected) > 3
하나의 문서에 대한 관련성 판단을 위해서는 위의 공식을 ‘단어’ 단위로 적용
Bayes’s theorem
applying this to the discriminant function,
assume that a document is represented by terms and these terems are statistically independent.
P(selected|rel)=P(t1 |rel)P(t2 |rel)....P(t n |rel)
4.6 Probabilistic Matching
If estimates for the probability of occurrence of various terms in relevant documents and in nonrelevant documents can be obtained, then the probabiliy that a document will be etreived can be estimated.
4.6 Probabilistic Matching
Example 전체 문서 중 관련 문서의 비율 = 0.1
1 보다 작으므로 검색되지 않음
4.6 Probabilistic Matching
4.7 Fuzzy Matching
probabilistic matching involves much calculation and many assumption.
In fuzzy matching the calculation is based on defined membership grades for terms.
this computation is simpler than that for probabilistic retrieval, since it involves simple functions of the membership grades for each document: fuzzy arithmetic 에 기반 e.g. Avg(max(D1(t1), D2(t1),...), max(D1(t2), D1(t2),...))
how such terms translate into the membership functions associated with fuzzy retrieval.
4.8 Proximity Matching
a much older and more widely used matching method involves the proximity of terms in a text.
Frequently proximity measures are used as additional criteria to further refine the set of documents identified by one of the other matching methods.
Modifications of proximity crireria can increase their effectiveness. e.g. ordered proximity
“junior college” vs. “college junior”
4.9 Effects of Weighting
Not all terms are equally important in a query. Weighting of terms modifies the calculations
upon which relevance judgments are made. Weighting can also be applied at a broder level
than individual terms. (beef and broccoli):5; (beef but not broccoli):2,
noodles:1; snow peas:1 Filtering without weighting: more complex
calculations will be confined to a relatively small set of documents.
4.10 Effects of Scaling impact of the size of the document collection can be major.
whether it will be feasible to apply it to real document collections false drops become more likely
documents that appear to match the query but are not appropriate 컴퓨터 문서 집합에서는 “ object-oriented programming” 의
허위 드롭 가능성이 작지만 , 일반 문서 집합에서는 크다 (TV도 object 로 취급 )
Information filtering produce a relatively small set containing a high proportion of
relevant document. 간단한 기법으로 작은 후보 문서 집합을 추출한 후 복잡한
기법으로 추출된 집합을 처리 : 금의 가공 과정과 유사
4.11 Data Fusion
no single retrieval technique will work equally well in all situations has led to data fusion the study of techniques for merging the results of
multiple search techniques on multiple databases to produce the best possible response to a query
to develop a retrieval technique that can adapt DB 의 표준화가 문제
to determine a method to fairly combine 서로 다른 성격의 measure 들을 결합
4.12 A User-Centered View
Each user has an individual vocabulary retrieval systems commonly miss some
documents that might have been informative to the user and retrieve others that the user does not find helpful