text features - university of washington...features •key to machine learning is having good...
TRANSCRIPT
![Page 1: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/1.jpg)
Text Features
![Page 2: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/2.jpg)
Features
• Key to machine learning is having good features
• In industrial data mining, large effort devoted to constructing appropriate features
2
![Page 3: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/3.jpg)
Issues in document representation
• Cooper’s vs. Cooper vs. Coopers.
• Full-text vs. full text vs. {full, text} vs. fulltext.
• résumé vs. resume.
Cooper’s concordance of Wordsworth was published in
1911. The applications of full-text retrieval are legion:
they include résumé scanning, litigation support and
searching published journals on-line.
slide from Raghavan, Schütze, Larson
![Page 4: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/4.jpg)
Punctuation
• Ne’er: use language-specific, handcrafted “locale” to normalize.
• State-of-the-art: break up hyphenated sequence.
• U.S.A. vs. USA - use locale.
• a.out
slide from Raghavan, Schütze, Larson
![Page 5: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/5.jpg)
Numbers
• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t index as text
– Creation dates for docs
slide from Raghavan, Schütze, Larson
![Page 6: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/6.jpg)
Case folding
• Reduce all letters to lower case
• Exception: upper case in mid-sentence
– e.g., General Motors
– Fed vs. fed
– SAIL vs. sail
slide from Raghavan, Schütze, Larson
![Page 7: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/7.jpg)
Thesauri and Soundex
• Handle synonyms and homonyms
– Hand-constructed equivalence classes
• e.g., car = automobile
• your ≠ you’re
• Index such equivalences?
• Or expand query?
slide from Raghavan, Schütze, Larson
![Page 8: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/8.jpg)
Spell Correction
• Look for all words within (say) edit distance 3 (Insert/Delete/Replace) at query time
– e.g., Alanis Morisette
• Spell correction is expensive and slows the query (up to a factor of 100)
– Invoke only when index returns zero matches?
– What if docs contain mis-spellings?
slide from Raghavan, Schütze, Larson
![Page 9: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/9.jpg)
Lemmatization
• Reduce inflectional/variant forms to base form – am, are, is be
– car, cars, car's, cars' car
the boy's cars are different colors
the boy car be different color
![Page 10: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/10.jpg)
Stemming
• Reduce terms to their “roots” before indexing
– language dependent
– e.g., automate(s), automatic, automation all reduced to automat.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compres and
compres are both accept
as equival to compres.
slide from Raghavan, Schütze, Larson
![Page 11: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/11.jpg)
Porter’s algorithm • Common algorithm for stemming English
• Conventions + 5 phases of reductions
– phases applied sequentially
– each phase consists of a set of commands
– sample convention: Of the rules in a compound command, select the one that applies to the longest suffix.
• Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html
slide from Raghavan, Schütze, Larson
![Page 12: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/12.jpg)
Typical rules in Porter
• sses ss
• ational ate
• tional tion
slide from Raghavan, Schütze, Larson
![Page 13: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/13.jpg)
Challenges
• Sandy
• Sanded
• Sander
Sand ???
slide from Raghavan, Schütze, Larson
![Page 14: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/14.jpg)
Properties of Text
• Word frequencies - skewed distribution
• `The’ and `of’ account for 10% of all words
• Six most common words account for 40%
Zipf’s Law:
Rank * probability = c
Eg, c = 0.1
From [Croft, Metzler & Strohman 2010] 14
![Page 15: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/15.jpg)
Associate Press Corpus `AP89’
From [Croft, Metzler & Strohman 2010] 15
![Page 16: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/16.jpg)
Middle Ground
• Very common words bad features
• Language-based stop list: words that bear little meaning
20-500 words http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
• Subject-dependent stop lists
• Very rare words also bad features Drop words appearing less than k times / corpus
16
![Page 17: Text Features - University of Washington...Features •Key to machine learning is having good features •In industrial data mining, large effort devoted to constructing appropriate](https://reader034.vdocuments.mx/reader034/viewer/2022042912/5f484fe6c1f81639896ebfb3/html5/thumbnails/17.jpg)
Beyond Words
• Look at capitalization (may indicated a proper noun)
• Look for commonly occurring sequences
• E.g. New York, New York City
• Limit to 2-3 consecutive words
• Keep all that meet minimum threshold (e.g. occur at least 5 or 10 times in corpus)
17