salton2-1 automatic indexing hsin-hsi chen. salton2-2 indexing indexing: assign identifiers to text...

Salton2-1 Automatic Indexing Hsin-Hsi Chen

Salton2-2 Indexing indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, single-term vs. term phrase

Salton2-3 Two Issues Issue 1: indexing exhaustivity exhaustive: assign a large number of terms nonexhaustive Issue 2: term specificity broad terms (generic) cannot distinguish relevant from nonrelevant items narrow terms (specific) retrieve relatively fewer items, but most of them are relevant

Salton2-4 Parameters of retrieval effectiveness Recall Precision Goal high recall and high precision

Salton2-5 Nonrelevant Items Relevant Items Retrieved Part a b c d

Salton2-6 A Joint Measure F-score is a parameter that encode the importance of recall and procedure. =1: equal weight >1: precision is more important

Salton2-10 A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.

Salton2-11 Discussions High-frequency terms favor recall high precision the ability to distinguish individual documents from each other high-frequency terms good for precision when its term frequency is not equally high in all documents.

Salton2-12 Inverse Document Frequency Inverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is number of documents in which Tj occurs. fulfil both the recall and the precision occur frequently in individual documents but rarely in the remainder of the collection

Salton2-13 New Term Importance Indicator weight w ij of a term T j in a document t i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

Salton2-14 Term-discrimination Value Useful index terms distinguish the documents of a collection from each other Document Space two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together when a high-frequency term without discrimination is assigned, it will increase the document space density

Salton2-15 Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space

Salton2-16 Good Term Assignment When a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection. This should increase the average distance between the items in the collection and hence produce a document space less dense than before.

Salton2-17 Poor Term Assignment A high frequency term is assigned that does not discriminate between the items of a collection. Its assignment will render the document more similar. This is reflected in an increase in document space density.

Salton2-18 Term Discrimination Value definition dv j = Q - Q j whereQ and Qj are space densities before and after the assignments of term Tj. dv j >0, T j is a good term; dv j

Salton2-19 Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j

Salton2-41 Term-Phrase Formation Term Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., computer science vs. computer; Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j

Salton2-52 Thesaurus-Group Generation Thesaurus transformation broadens index terms whose scope is too narrow to be useful in retrieval a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j

Salton2-57 Word Stemming effectiveness --> effective --> effect picnicking --> picnic king -\-> k

Salton2-58 Some Morphological Rules Restore a silent e after suffix removal from certain words to produce hope from hoping rather than hop Delete certain doubled consonants after suffix removal, so as to generate hop from hopping rather than hopp. Use a final y for an I in forms such as easier, so as to generate easy instead of easi.

Salton2-59 The Indexing Prescription (2) Identify individual text words. Use stop list to delete common function words. Use automatic suffix stripping to produce word stems. Compute term-discrimination value for all word stems. Use thesaurus class replacement for all low-frequency terms with discrimination values near zero. Use phrase-formation process for all high-frequency terms with negative discrimination values. Compute weighting factors for complex indexing units. Assign to each document single term weights, term phrases, and thesaurus classes with weights.

Salton2-60 Query vs. Document Differences Query texts are short. Fewer terms are assigned to queries. The occurrence of query terms rarely exceeds 1. Q=(w q1, w q2, , w qt ) where w qj : inverse document frequency D i =(d i1, d i2, , d it ) where d ij : term frequency*inverse document frequency

Salton2-61 Query vs. Document When non-normalized documents are used, the longer documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors. or

Salton2-62 Relevance Feedback Terms present in previously retrieved documents that have been identified as relevant to the users query are added to the original formulations. The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.

Salton2-63 Relevance Feedback Q = (w q1, w q2,..., w qt ) D i = (d i1, d i2,..., d it ) New query may be the following form Q = {w q1, w q2,..., w qt }+ {w qt+1, w qt+2,..., w qt+m } The weights of the newly added terms T t+1 to T t+m may consist of a combined term- frequency and term-relevance weight. CLARIT NP Extractor ----> {Raw Noun Phrases} ----> Stati">

Salton2-71 Experiment Design CLARIT commercial retrieval system {original document set} ----> CLARIT NP Extractor ----> {Raw Noun Phrases} ----> Statistical NP Parser, Phrase Extractor ----> {Indexing Term Set} ----> CLARIT Retrieval Engine

Salton2-72 Different Indexing Units example [[[heavy construction] industry] group] (WSJ90) single words heavy, construction, industry, group head modifier pairs heavy construction, construction industry, industry group full noun phrases heavy construction industry group

Salton2-73 Different Indexing Units (Continued) WD-SET single word only (no phrases, baseline) WD-HM-SET single word + head modifier pair WD-NP-SET single word + full NP WD-HM-NP-SET single word + head modifier + full NP

Salton2-74 Result Analysis Collection: Tipster Disk 2 (250MB) Query: TREC-5 ad hoc topics (251-300) relevance feedback: top 10 documents returned from initial retrieval evaluation total number of relevant documents retrieved highest level of precision over all the points of recall average precision

Salton2-75 Effects of phrases with feedback and TREC-5

Salton2-76 Summary When only one kind of phrase is used to supplement the single words, each can lead to a great improvement in precision. When we combine the two kinds of phrases, the effect is a greater improvement in recall rather than precision. How to combine and weight different phrases effectively becomes an important issue.

Salton2-77 A Corpus-Based Statistical Approach to Automatic Book Indexing Jyun-Sheng Chang, Tsung-Yih Tseng, Ying Cheng, Huey-Chyun Chen, Shun-Der Cheng, Sur-Jin Ker, and John S. Liu (ANLP92, pp. 147-151)

Salton2-78 Generating Indices Word Segmentation Part-of-speech tagging Finding noun phrases

Salton2-79 Example of Problem Description Segmentation, tagging, noun phrase finding / / / / / / / P/Q/CL/LOC/CTM/NC/NC/ / / / / / / / / / / / P/D/Q/CL/NC/LOC/LOC/LOC/V/CTM/NC/ / / / / / / NP/ADV/V/NC/CTM/NC/ / / / / / / / P/NC/CTM/NC/CTM/NC/NC/

Salton2-80 Word Segmentation Given a Chinese sentence, segment the sentence into words. / / / / / / /

Salton2-81 Segmentation as a Constraint Satisfaction Problem Given a sequence of Chinese characters, C1, C2, , Cn, assign break/continue to each place Xi between two adjacent characters Ci and Ci+1. > = = > > = > = > > > = break: >, continue: =

Salton2-82 Detail specification For each sequence of characters Ci, , Cj which are a Chinese word in the dictionary or a surname-name, if j=i then put (>, >) in Ki-1,i. If j > I, the put (>,=) in Ki-1, i, (=, =) in Ki, i+1, , and Kj-1, j.

Salton2-83 Example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 assume , , , , , , , , , , , are words by dictionary lookup 0 (>,>) 1 (>,=) 2 (=,>), (=,=) 3 (=,>) 4 (>,>) 5 (>,>) 6 (>,>), (>,=) 7 (>,>), (=,>), (>, =) 8 (>,>), (=,>), (>, =) 9 (>,>), (=,>), (>, =) 10 (>,>), (=,>), (>, =) 11 (>,>) 12 (>,>), (>, =) 13 (=, >)

Salton2-84 Differences from English IR Data analysis issue media: syllable structure in speech data code & character: GB and BIG conversion, rigid semantic in character word: simple in word stemming, spelling, hard in word segmentation, proper noun identification (Dr. L.F. Chien, 1996)

Salton2-85 Differences from English IR (Continued) Interface issues input: eager in speech and OCR input query: need searching for approximate terms, rigid information in NLQ, hard to find proper noun in NLQ

Salton2-86 Differences from English IR (Continued) Indexing and searching issues index: hard to use word-level and complete index like inverted file workable to use character-level and filtering index like signature (Chien, 1995) searching: need multiple-stage searching, need best match in term level

Salton2-87 Segmentation Problem Segmentation is a serious problem in processing Chinese sentences (Hsin-Hsi Chen, 1996)

Salton2-88 Strategies Dictionary lookup supplemented by other special strategies the-longest-word-the-first the number of words ...

salton2-1 automatic indexing hsin-hsi chen. salton2-2 indexing indexing: assign identifiers to text...

Documents

term phrase slide

term t j

collection slide

high precision slide

term frequency tf ij

document t i

relevant slide

highfrequency terms