tokenization - simon fraser university slides/l07 - tokenization.… · j. pei: information...

Tokenization

J. Pei: Information Retrieval and Web Search -- Tokenization 2

Where Are Terms From?

•  How can we derive terms from documents to be indexed? –  Collect the documents to be indexed –  Tokenize the text –  Linguistic preprocessing of tokens

•  Language-dependent –  Use heuristic methods, user selection, metadata, or

machine learning methods to determine the language of the document

–  Some language (e.g., Arabic) may need special sequencing preprocessing


Choosing a Proper Document Unit

•  Many possible choices –  Each file in a folder as a document –  In an mbox-format UNIX email file, each email within the large file is

treated as a document –  Within an email, each attachment may be treated as a document

•  Why does indexing granularity matter? –  A tradeoff between precision and recall –  Big granularity often leads to low accuracy – searching for “Kung

Fu Panda” may return a book containing “Kung Fu” at the beginning and “Panda” at the end

–  Very small granularity often leads to low recall – searching “Beijing Olympic” may miss the two sentences “I went to Beijing to join my friends. We watched the Olympic games together.”


Tokenization

•  Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation


Tokens, Types, and Terms •  Text: “to sleep perchance to dream” •  A token is an instance of a sequence of

characters in some particular document that are grouped together as a useful semantic unit for processing –  Examples: “to”, “sleep”, “perchance”, “to”, “dream”

•  A type is the class of all tokens containing the same character sequence –  Examples: “to”, “sleep”, “perchance”, “dream”

•  A term is a (perhaps normalized) type that is included in the IR system’s dictionary –  Example: “sleep”, “perchance”, “dream”


Apostrophes

•  Used for possession and contractions Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.


Specific Tokens

•  C++, C# •  B-52, B777 •  M*A*S*H •  Email addresses ([email protected]) •  Web URLs (http://www.cs.sfu.ca) •  IP addresses (142.32.48.231) •  Phone number (778-782-3054) •  City names (San Francisco, New York)


Hyphens •  Hyphenation is used in English for

–  Splitting up vowels in words (co-education) –  Joining nouns as names (Hewlett-Packard) –  Showing word grouping (the hold-him-back-and-drag-

him-away maneuver) –  Special usage (San Francisco-Los Angeles)

•  Splitting on white space may not always be desirable –  “New York University” should not be returned for query

“York University” –  “lowercase”, “lower-case”, and “lower case” are

equivalent


Word Segmentation

•  In some languages (e.g., Chinese), text is written without any spaces between words 信息检索和Web搜索是一门很有意思的课程。

•  Word segmentation methods – Use a large vocabulary and take the longest

vocabulary match – Machine learning sequence models (e.g.,

Markov models, conditional random fields) – Character k-grams


Stop Words •  Some extremely common words that would appear

to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely –  Determined by collection frequency – the total number

of times each term appears in the document collection •  Using a stop list significantly reduces the number

of postings that a system has to store

25 semantically nonselective words that are common in Reuters-RCV1


Using Stop Words or Not?

•  Phrase searches – “President of the United States” – “flights from Vancouver” vs. “flights to

Vancouver” – “To be or not to be, Let It Be, I don’t want to be,

…” •  Web search engines generally do not use

stop lists – Some specific techniques introduced later can

reduce the cost due to stop words


Token Normalization

•  How can we know USA matches U.S.A? •  Token normalization is the process of

canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens


Creating Equivalence Classes •  Using rules removing characters like hyphens

–  Both “anti-discriminatory” and “antidiscriminatory” map to “antidiscriminatory”

•  Maintaining relations between unnormalized tokens –  Indexing unnormalized tokens and maintain a query

expansion list of multiple vocabulary entries: when a query asks for “car”, search both “car” and “automobile”

–  Expansion during index construction: index a document containing “car” under “car” and “automobile”

–  Expansion of query terms can be asymmetric –  A more space costly but more flexible method


Example


Normalization Techniques

•  Accents and diacritics –  Normalizing tokens to remove diacritics

•  Capitalization/case-folding –  Reducing all letters to lower case

•  May cause problems for names such as Bush, Black, Fed, … –  Use some heuristics to make some tokens lowercase,

e.g., covert the first word in a sentence, all words in a title

–  Truecasing: use a machine learning sequence model to make the decision


Stemming and Lemmatization

•  How can we know “organize”, “organizes”, and “organizing” should map to the same word?

•  Stemming and lemmatization: reduce inflectional forms and sometimes derivationally related forms of a word to a common base form –  am, are, is be –  car, cars, car’s, cars’ car –  “the boy’s cars are different colors” “the boy car be

different color”


Stemming •  Algorithmic: a crude heuristic process that chops

off the ends of words in the hope of being correct most of the time –  Often remove derivational affixes

•  Porter’s algorithm

–  Use “(m>1) EMENT ” to map replacement to replac


Comparison of Stemmers


Lemmatization

•  Dictionary-based stemming •  Use a vocabulary and morphological analysis of

words to remove inflectional endings only and return the base or dictionary form of a word (lemma)

•  “saw” “see” or “saw” depending on whether the token is used as a verb or a noun

•  Can bring very modest benefit for retrieval in English – improving recall but may hurt accuracy


Krovetz Stemmer – A Hybrid Method

•  A hybrid approach •  Constantly using a dictionary to check if a word is

valid –  If a word is not found, check the word against a list of

common inflectional and derivational suffixes, modify the word and check the dictionary again

•  Using manually generated exception entries to record special stemming processing rules

•  Low false positive rate, but tends to a high false negative rate

•  Producing stems that, in most cases, are full words


Comparison


Phrases •  Phrases are important in queries

–  For query “black sea”, a document containing sentence “the sea turned black” may not be good

•  Phrases are often subtle –  For query “fishing supplies”, should documents

containing “fish”, “fishing”, and “supplies” count? •  How phrases should be identified in tokenizing and

stemming? –  N-gram method: a phrase is any sequence of n words –  Many search engines index all n-grams of 2 ≤ n ≤ 5 –  In a document of 1,000 words, there are 3,990 n-grams

for 2 ≤ n ≤ 5


Part-of-Speech (POS) Tagger •  A POS tagger marks the words in a text with labels

corresponding to the part-of-speech of the word in that context –  Based on statistical or rule-based approaches –  Trained using large corpora manually labeled

•  Typical tags –  NN (single noun), NNS (plural noun), VB (verb), VBD

(verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), MD (modal auxiliary, e.g., “can”, “will”)

•  Noun phrases: sequences of nouns or adjectives followed by nouns


POS Tagger Example

Classroom discussion: What is the major drawback of POS tagger in web search?


Summary

•  It is an important task to extract tokens from documents

•  Choosing document units •  Tokenization •  Stop words and using stop words •  Token normalization •  Stemming and lemmatization •  Processing phrases


To-do List

•  Read Section 4.3 in the textbook •  Try out the Porter Stemmer

tokenization - simon fraser university slides/l07 - tokenization.… · j. pei: information...

Documents