Download - Automatic Indexing (Term Selection)
![Page 1: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/1.jpg)
Automatic Indexing (Term Selection)
Automatic Text Processingby G. Salton, Chap 9, Addison-Wesley, 1989.
![Page 2: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/2.jpg)
Automatic Indexing
Indexing: assign identifiers (index terms) to text documents.
Identifiers: single-term vs. term phrase controlled vs. uncontrolled vocabularies
instruction manuals, terminological schedules, … objective vs. nonobjective text identifiers
cataloging rules define, e.g., author names, publisher names, dates of publications, …
![Page 3: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/3.jpg)
Two Issues
Issue 1: indexing exhaustivity exhaustive: assign a large number of terms nonexhaustive
Issue 2: term specificity broad terms (generic)
cannot distinguish relevant from nonrelevant documents narrow terms (specific)
retrieve relatively fewer documents, but most of them are relevant
![Page 4: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/4.jpg)
Term-Frequency Consideration Function words
for example, "and", "or", "of", "but", … the frequencies of these words are high in all texts
Content words words that actually relate to document content varying frequencies in the different texts of a collect indicate term importance for content
![Page 5: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/5.jpg)
A Frequency-Based Indexing Method
Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.
Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.
Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.
![Page 6: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/6.jpg)
How to compute wij ? Inverse document frequency, idfj
tfij*idfj (TFxIDF) Term discrimination value, dvj
tfij*dvj
Probabilistic term weighting trj tfij*trj
Global properties of terms in a document collection
![Page 7: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/7.jpg)
Inverse Document Frequency
Inverse Document Frequency (IDF) for term Tj
where dfj (document frequency of term Tj) is thenumber of documents in which Tj occurs.
fulfil both the recall and the precision occur frequently in individual documents but rarely in the re
mainder of the collection
idfN
dfj
j
log
![Page 8: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/8.jpg)
TFxIDF Weight wij of a term Tj in a document di
Eliminating common function words Computing the value of wij for each term Tj in each document Di
Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors
w tfN
dfij ij
j
log
![Page 9: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/9.jpg)
Term-discrimination Value
Useful index terms Distinguish the documents of a collection from
each other Document Space
Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together
When a high-frequency term without discrimination is assigned, it will increase the document space density
![Page 10: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/10.jpg)
Original State After Assignment of good discriminator
After Assignment of poor discriminator
A Virtual Document Space
![Page 11: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/11.jpg)
Good Term Assignment
When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection.
This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.
![Page 12: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/12.jpg)
Poor Term Assignment
A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar.
This is reflected in an increase in document space density.
![Page 13: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/13.jpg)
Term Discrimination Value
Definitiondvj = Q - Qj
where Q and Qj are space densities before and after the assignments of term Tj.
dvj>0, Tj is a good term; dvj<0, Tj is a poor term.
QN N
sim D Di kki k
N
i
N
1
1 11( )( , )
![Page 14: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/14.jpg)
DocumentFrequency
Low frequency
dvj=0Medium frequency
dvj>0
High frequency
dvj<0
N
Variations of Term-Discrimination Valuewith Document Frequency
![Page 15: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/15.jpg)
TFij x dvj
wij = tfij x dvj
compared with
: decrease steadily with increasing document frequency
dvj: increase from zero to positive as the document frequency of the term increase,
decrease shapely as the document frequency becomes still larger.
w tfN
dfij ij
j
log
N
df j
![Page 16: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/16.jpg)
Document Centroid Issue: efficiency problem
N(N-1) pairwise similarities Document centroid C = (c1, c2, c3, ..., ct)
where wij is the j-th term in document i. Space density
N
iijj wc
1
N
iiDCsim
NQ
1
),(1
![Page 17: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/17.jpg)
Probabilistic Term Weighting
GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection
DefinitionGiven a user query q, and the ideal answer set of the relevant documents
From decision theory, the best ranking algorithm for a document D
)Pr(
)Pr(log
)|Pr(
)|Pr(log)(
nonrel
rel
nonrelD
relDDg
![Page 18: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/18.jpg)
Probabilistic Term Weighting
Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance
Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets
![Page 19: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/19.jpg)
t
ii
t
ii
nonrelxnonrelD
relxrelD
1
1
)|Pr()|Pr(
)|Pr()|Pr(
Assumptions
Terms occur independently in documents
![Page 20: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/20.jpg)
Derivation Process
)Pr(
)Pr(log
)|Pr(
)|Pr(log)(
nonrel
rel
nonrelD
relDDg
log
Pr( | )
Pr( | )
x rel
x nonrel
ii
t
ii
t1
1
constants
log
Pr( | )
Pr( | )
x rel
x nonreli
ii
t
1
constants
![Page 21: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/21.jpg)
Given a document D=(d1, d2, …, dt)
Assume di is either 0 (absent) or 1 (present).
Pr( | ) ( )
Pr( | ) ( )
x d rel p p
x d nonrel q q
i i i
d
i
d
i i i
d
i
d
i i
i i
1
1
1
1
Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-piPr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi
g Dx d rel
x d nonreli i
i ii
t
( ) logPr( | )
Pr( | )
1
constants
For a specific document D
![Page 22: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/22.jpg)
g Dx d rel
x d nonreli i
i ii
t
( ) logPr( | )
Pr( | )
1
constants
log( )
( )
d d
d d
i i
i i
p p
q q
i i
i ii
t1
11
11
constants
log( ) ( )
( ) ( )
d d
d d
i i
i i
p q p
q p qi i i
i i ii
t 1 1
1 11
constants
constantslog1 )1())1((
)1())1((
t
iiii
iii
qpq
pqpi
i
d
d
![Page 23: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/23.jpg)
trp q
q pj
j j
j j
log( )
( )
1
1
g Dp
qd
p q
q pi
ii
t
ii i
i ii
t
( ) log log( )
( )
1
1
1
11 1constants
Term Relevance Weight
![Page 24: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/24.jpg)
Issue
How to compute pj and qj ?
pj = rj / Rqj = (dfj-rj)/(N-R)
R: the total number of relevant documents N: the total number of documents
![Page 25: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/25.jpg)
Estimation of Term-Relevance
The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collection
qj = dfj / N
The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj = 0.5 for all j.
![Page 26: Automatic Indexing (Term Selection)](https://reader036.vdocuments.mx/reader036/viewer/2022062314/5681301b550346895d9596e5/html5/thumbnails/26.jpg)
5.0*
)1(*5.0log
)1(
)1(log
N
dfN
df
pq
qptr
j
j
jj
jjj
j
j
df
dfN )(log
When N is sufficiently large, N-dfj N,
j
jj
df
dfNtr
)(log
jdf
Nlog = idfj
Comparison