tokenization - simon fraser university slides/l07 - tokenization.… · j. pei: information...

26
Tokenization

Upload: doandan

Post on 12-Sep-2018

226 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

Tokenization

Page 2: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 2

Where Are Terms From?

•  How can we derive terms from documents to be indexed? –  Collect the documents to be indexed –  Tokenize the text –  Linguistic preprocessing of tokens

•  Language-dependent –  Use heuristic methods, user selection, metadata, or

machine learning methods to determine the language of the document

–  Some language (e.g., Arabic) may need special sequencing preprocessing

Page 3: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 3

Choosing a Proper Document Unit

•  Many possible choices –  Each file in a folder as a document –  In an mbox-format UNIX email file, each email within the large file is

treated as a document –  Within an email, each attachment may be treated as a document

•  Why does indexing granularity matter? –  A tradeoff between precision and recall –  Big granularity often leads to low accuracy – searching for “Kung

Fu Panda” may return a book containing “Kung Fu” at the beginning and “Panda” at the end

–  Very small granularity often leads to low recall – searching “Beijing Olympic” may miss the two sentences “I went to Beijing to join my friends. We watched the Olympic games together.”

Page 4: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 4

Tokenization

•  Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation

Page 5: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 5

Tokens, Types, and Terms •  Text: “to sleep perchance to dream” •  A token is an instance of a sequence of

characters in some particular document that are grouped together as a useful semantic unit for processing –  Examples: “to”, “sleep”, “perchance”, “to”, “dream”

•  A type is the class of all tokens containing the same character sequence –  Examples: “to”, “sleep”, “perchance”, “dream”

•  A term is a (perhaps normalized) type that is included in the IR system’s dictionary –  Example: “sleep”, “perchance”, “dream”

Page 6: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 6

Apostrophes

•  Used for possession and contractions Mr. O’Neill thinks that the boys’ stories about

Chile’s capital aren’t amusing.

Page 7: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 7

Specific Tokens

•  C++, C# •  B-52, B777 •  M*A*S*H •  Email addresses ([email protected]) •  Web URLs (http://www.cs.sfu.ca) •  IP addresses (142.32.48.231) •  Phone number (778-782-3054) •  City names (San Francisco, New York)

Page 8: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 8

Hyphens •  Hyphenation is used in English for

–  Splitting up vowels in words (co-education) –  Joining nouns as names (Hewlett-Packard) –  Showing word grouping (the hold-him-back-and-drag-

him-away maneuver) –  Special usage (San Francisco-Los Angeles)

•  Splitting on white space may not always be desirable –  “New York University” should not be returned for query

“York University” –  “lowercase”, “lower-case”, and “lower case” are

equivalent

Page 9: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 9

Word Segmentation

•  In some languages (e.g., Chinese), text is written without any spaces between words 信息检索和Web搜索是一门很有意思的课程。

•  Word segmentation methods – Use a large vocabulary and take the longest

vocabulary match – Machine learning sequence models (e.g.,

Markov models, conditional random fields) – Character k-grams

Page 10: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 10

Stop Words •  Some extremely common words that would appear

to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely –  Determined by collection frequency – the total number

of times each term appears in the document collection •  Using a stop list significantly reduces the number

of postings that a system has to store

25 semantically nonselective words that are common in Reuters-RCV1

Page 11: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 11

Using Stop Words or Not?

•  Phrase searches – “President of the United States” – “flights from Vancouver” vs. “flights to

Vancouver” – “To be or not to be, Let It Be, I don’t want to be,

…” •  Web search engines generally do not use

stop lists – Some specific techniques introduced later can

reduce the cost due to stop words

Page 12: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 12

Token Normalization

•  How can we know USA matches U.S.A? •  Token normalization is the process of

canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens

Page 13: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 13

Creating Equivalence Classes •  Using rules removing characters like hyphens

–  Both “anti-discriminatory” and “antidiscriminatory” map to “antidiscriminatory”

•  Maintaining relations between unnormalized tokens –  Indexing unnormalized tokens and maintain a query

expansion list of multiple vocabulary entries: when a query asks for “car”, search both “car” and “automobile”

–  Expansion during index construction: index a document containing “car” under “car” and “automobile”

–  Expansion of query terms can be asymmetric –  A more space costly but more flexible method

Page 14: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 14

Example

Page 15: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 15

Normalization Techniques

•  Accents and diacritics –  Normalizing tokens to remove diacritics

•  Capitalization/case-folding –  Reducing all letters to lower case

•  May cause problems for names such as Bush, Black, Fed, … –  Use some heuristics to make some tokens lowercase,

e.g., covert the first word in a sentence, all words in a title

–  Truecasing: use a machine learning sequence model to make the decision

Page 16: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 16

Stemming and Lemmatization

•  How can we know “organize”, “organizes”, and “organizing” should map to the same word?

•  Stemming and lemmatization: reduce inflectional forms and sometimes derivationally related forms of a word to a common base form –  am, are, is be –  car, cars, car’s, cars’ car –  “the boy’s cars are different colors” “the boy car be

different color”

Page 17: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 17

Stemming •  Algorithmic: a crude heuristic process that chops

off the ends of words in the hope of being correct most of the time –  Often remove derivational affixes

•  Porter’s algorithm

–  Use “(m>1) EMENT ” to map replacement to replac

Page 18: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 18

Comparison of Stemmers

Page 19: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 19

Lemmatization

•  Dictionary-based stemming •  Use a vocabulary and morphological analysis of

words to remove inflectional endings only and return the base or dictionary form of a word (lemma)

•  “saw” “see” or “saw” depending on whether the token is used as a verb or a noun

•  Can bring very modest benefit for retrieval in English – improving recall but may hurt accuracy

Page 20: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 20

Krovetz Stemmer – A Hybrid Method

•  A hybrid approach •  Constantly using a dictionary to check if a word is

valid –  If a word is not found, check the word against a list of

common inflectional and derivational suffixes, modify the word and check the dictionary again

•  Using manually generated exception entries to record special stemming processing rules

•  Low false positive rate, but tends to a high false negative rate

•  Producing stems that, in most cases, are full words

Page 21: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 21

Comparison

Page 22: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 22

Phrases •  Phrases are important in queries

–  For query “black sea”, a document containing sentence “the sea turned black” may not be good

•  Phrases are often subtle –  For query “fishing supplies”, should documents

containing “fish”, “fishing”, and “supplies” count? •  How phrases should be identified in tokenizing and

stemming? –  N-gram method: a phrase is any sequence of n words –  Many search engines index all n-grams of 2 ≤ n ≤ 5 –  In a document of 1,000 words, there are 3,990 n-grams

for 2 ≤ n ≤ 5

Page 23: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 23

Part-of-Speech (POS) Tagger •  A POS tagger marks the words in a text with labels

corresponding to the part-of-speech of the word in that context –  Based on statistical or rule-based approaches –  Trained using large corpora manually labeled

•  Typical tags –  NN (single noun), NNS (plural noun), VB (verb), VBD

(verb, past tense), VBN (verb, past participle), IN (preposition), JJ (adjective), CC (conjunction, e.g., “and”, “or”), PRP (pronoun), MD (modal auxiliary, e.g., “can”, “will”)

•  Noun phrases: sequences of nouns or adjectives followed by nouns

Page 24: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 24

POS Tagger Example

Classroom discussion: What is the major drawback of POS tagger in web search?

Page 25: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 25

Summary

•  It is an important task to extract tokens from documents

•  Choosing document units •  Tokenization •  Stop words and using stop words •  Token normalization •  Stemming and lemmatization •  Processing phrases

Page 26: Tokenization - Simon Fraser University slides/L07 - Tokenization.… · J. Pei: Information Retrieval and Web Search -- Tokenization 5 Tokens, Types, and Terms • Text: “to sleep

J. Pei: Information Retrieval and Web Search -- Tokenization 26

To-do List

•  Read Section 4.3 in the textbook •  Try out the Porter Stemmer