salton2-1 automatic indexing hsin-hsi chen. salton2-2 indexing indexing: assign identifiers to text...

Click here to load reader

Upload: whitney-edwards

Post on 26-Dec-2015

236 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Salton2-1 Automatic Indexing Hsin-Hsi Chen
  • Slide 2
  • Salton2-2 Indexing indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, single-term vs. term phrase
  • Slide 3
  • Salton2-3 Two Issues Issue 1: indexing exhaustivity exhaustive: assign a large number of terms nonexhaustive Issue 2: term specificity broad terms (generic) cannot distinguish relevant from nonrelevant items narrow terms (specific) retrieve relatively fewer items, but most of them are relevant
  • Slide 4
  • Salton2-4 Parameters of retrieval effectiveness Recall Precision Goal high recall and high precision
  • Slide 5
  • Salton2-5 Nonrelevant Items Relevant Items Retrieved Part a b c d
  • Slide 6
  • Salton2-6 A Joint Measure F-score is a parameter that encode the importance of recall and procedure. =1: equal weight >1: precision is more important
  • Salton2-10 A Frequency-Based Indexing Method Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. Compute the term frequency tf ij for all remaining terms T j in each document D i, specifying the number of occurrences of T j in D i. Choose a threshold frequency T, and assign to each document D i all term T j for which tf ij > T.
  • Slide 11
  • Salton2-11 Discussions High-frequency terms favor recall high precision the ability to distinguish individual documents from each other high-frequency terms good for precision when its term frequency is not equally high in all documents.
  • Slide 12
  • Salton2-12 Inverse Document Frequency Inverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is number of documents in which Tj occurs. fulfil both the recall and the precision occur frequently in individual documents but rarely in the remainder of the collection
  • Slide 13
  • Salton2-13 New Term Importance Indicator weight w ij of a term T j in a document t i Eliminating common function words Computing the value of w ij for each term T j in each document D i Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors
  • Slide 14
  • Salton2-14 Term-discrimination Value Useful index terms distinguish the documents of a collection from each other Document Space two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together when a high-frequency term without discrimination is assigned, it will increase the document space density
  • Slide 15
  • Salton2-15 Original State After Assignment of good discriminator After Assignment of poor discriminator A Virtual Document Space
  • Slide 16
  • Salton2-16 Good Term Assignment When a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection. This should increase the average distance between the items in the collection and hence produce a document space less dense than before.
  • Slide 17
  • Salton2-17 Poor Term Assignment A high frequency term is assigned that does not discriminate between the items of a collection. Its assignment will render the document more similar. This is reflected in an increase in document space density.
  • Slide 18
  • Salton2-18 Term Discrimination Value definition dv j = Q - Q j whereQ and Qj are space densities before and after the assignments of term Tj. dv j >0, T j is a good term; dv j
  • Salton2-19 Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j
  • Salton2-41 Term-Phrase Formation Term Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., computer science vs. computer; Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j
  • Salton2-52 Thesaurus-Group Generation Thesaurus transformation broadens index terms whose scope is too narrow to be useful in retrieval a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators Document Frequency Low frequency dv j =0 Medium frequency dv j >0 High frequency dv j
  • Salton2-57 Word Stemming effectiveness --> effective --> effect picnicking --> picnic king -\-> k
  • Slide 58
  • Salton2-58 Some Morphological Rules Restore a silent e after suffix removal from certain words to produce hope from hoping rather than hop Delete certain doubled consonants after suffix removal, so as to generate hop from hopping rather than hopp. Use a final y for an I in forms such as easier, so as to generate easy instead of easi.
  • Slide 59
  • Salton2-59 The Indexing Prescription (2) Identify individual text words. Use stop list to delete common function words. Use automatic suffix stripping to produce word stems. Compute term-discrimination value for all word stems. Use thesaurus class replacement for all low-frequency terms with discrimination values near zero. Use phrase-formation process for all high-frequency terms with negative discrimination values. Compute weighting factors for complex indexing units. Assign to each document single term weights, term phrases, and thesaurus classes with weights.
  • Slide 60
  • Salton2-60 Query vs. Document Differences Query texts are short. Fewer terms are assigned to queries. The occurrence of query terms rarely exceeds 1. Q=(w q1, w q2, , w qt ) where w qj : inverse document frequency D i =(d i1, d i2, , d it ) where d ij : term frequency*inverse document frequency
  • Slide 61
  • Salton2-61 Query vs. Document When non-normalized documents are used, the longer documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors. or
  • Slide 62
  • Salton2-62 Relevance Feedback Terms present in previously retrieved documents that have been identified as relevant to the users query are added to the original formulations. The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.
  • Slide 63
  • Salton2-63 Relevance Feedback Q = (w q1, w q2,..., w qt ) D i = (d i1, d i2,..., d it ) New query may be the following form Q = {w q1, w q2,..., w qt }+ {w qt+1, w qt+2,..., w qt+m } The weights of the newly added terms T t+1 to T t+m may consist of a combined term- frequency and term-relevance weight. CLARIT NP Extractor ----> {Raw Noun Phrases} ----> Stati">
  • Salton2-71 Experiment Design CLARIT commercial retrieval system {original document set} ----> CLARIT NP Extractor ----> {Raw Noun Phrases} ----> Statistical NP Parser, Phrase Extractor ----> {Indexing Term Set} ----> CLARIT Retrieval Engine
  • Slide 72
  • Salton2-72 Different Indexing Units example [[[heavy construction] industry] group] (WSJ90) single words heavy, construction, industry, group head modifier pairs heavy construction, construction industry, industry group full noun phrases heavy construction industry group
  • Slide 73
  • Salton2-73 Different Indexing Units (Continued) WD-SET single word only (no phrases, baseline) WD-HM-SET single word + head modifier pair WD-NP-SET single word + full NP WD-HM-NP-SET single word + head modifier + full NP
  • Slide 74
  • Salton2-74 Result Analysis Collection: Tipster Disk 2 (250MB) Query: TREC-5 ad hoc topics (251-300) relevance feedback: top 10 documents returned from initial retrieval evaluation total number of relevant documents retrieved highest level of precision over all the points of recall average precision
  • Slide 75
  • Salton2-75 Effects of phrases with feedback and TREC-5
  • Slide 76
  • Salton2-76 Summary When only one kind of phrase is used to supplement the single words, each can lead to a great improvement in precision. When we combine the two kinds of phrases, the effect is a greater improvement in recall rather than precision. How to combine and weight different phrases effectively becomes an important issue.
  • Slide 77
  • Salton2-77 A Corpus-Based Statistical Approach to Automatic Book Indexing Jyun-Sheng Chang, Tsung-Yih Tseng, Ying Cheng, Huey-Chyun Chen, Shun-Der Cheng, Sur-Jin Ker, and John S. Liu (ANLP92, pp. 147-151)
  • Slide 78
  • Salton2-78 Generating Indices Word Segmentation Part-of-speech tagging Finding noun phrases
  • Slide 79
  • Salton2-79 Example of Problem Description Segmentation, tagging, noun phrase finding / / / / / / / P/Q/CL/LOC/CTM/NC/NC/ / / / / / / / / / / / P/D/Q/CL/NC/LOC/LOC/LOC/V/CTM/NC/ / / / / / / NP/ADV/V/NC/CTM/NC/ / / / / / / / P/NC/CTM/NC/CTM/NC/NC/
  • Slide 80
  • Salton2-80 Word Segmentation Given a Chinese sentence, segment the sentence into words. / / / / / / /
  • Slide 81
  • Salton2-81 Segmentation as a Constraint Satisfaction Problem Given a sequence of Chinese characters, C1, C2, , Cn, assign break/continue to each place Xi between two adjacent characters Ci and Ci+1. > = = > > = > = > > > = break: >, continue: =
  • Slide 82
  • Salton2-82 Detail specification For each sequence of characters Ci, , Cj which are a Chinese word in the dictionary or a surname-name, if j=i then put (>, >) in Ki-1,i. If j > I, the put (>,=) in Ki-1, i, (=, =) in Ki, i+1, , and Kj-1, j.
  • Slide 83
  • Salton2-83 Example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 assume , , , , , , , , , , , are words by dictionary lookup 0 (>,>) 1 (>,=) 2 (=,>), (=,=) 3 (=,>) 4 (>,>) 5 (>,>) 6 (>,>), (>,=) 7 (>,>), (=,>), (>, =) 8 (>,>), (=,>), (>, =) 9 (>,>), (=,>), (>, =) 10 (>,>), (=,>), (>, =) 11 (>,>) 12 (>,>), (>, =) 13 (=, >)
  • Slide 84
  • Salton2-84 Differences from English IR Data analysis issue media: syllable structure in speech data code & character: GB and BIG conversion, rigid semantic in character word: simple in word stemming, spelling, hard in word segmentation, proper noun identification (Dr. L.F. Chien, 1996)
  • Slide 85
  • Salton2-85 Differences from English IR (Continued) Interface issues input: eager in speech and OCR input query: need searching for approximate terms, rigid information in NLQ, hard to find proper noun in NLQ
  • Slide 86
  • Salton2-86 Differences from English IR (Continued) Indexing and searching issues index: hard to use word-level and complete index like inverted file workable to use character-level and filtering index like signature (Chien, 1995) searching: need multiple-stage searching, need best match in term level
  • Slide 87
  • Salton2-87 Segmentation Problem Segmentation is a serious problem in processing Chinese sentences (Hsin-Hsi Chen, 1996)
  • Slide 88
  • Salton2-88 Strategies Dictionary lookup supplemented by other special strategies the-longest-word-the-first the number of words ...