march 2006introduction to computational linguistics 1 clint tokenisation

21
March 2006 Introduction to Computati onal Linguistics 1 CLINT Tokenisation

Upload: milton-cooper

Post on 18-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

1

CLINT

Tokenisation

Page 2: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

2

Information Food Chain Inference↑ Knowledge Representation↑ Meaning Extraction↑ Semantic Relationships↑ Chunking (noun phrases; verb

phrases)↑ Part of Speech Annotation↑ Paragraph and sentence identification↑ Tokenisation↑ Raw Text

Page 3: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

3

Start with a Corpus

• A corpus is an organised body of materials from language that is used as a basis for empirical studies.

• Corpora classfied according to– Representativeness– Medium– Language– Information Content– Structure

Page 4: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

4

Examples of Corpora

• Project Gutenberg: public domain text resources. http://www.promo.net/pg

• Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70

• Penn Treebank: a corpus of parsed sentences based on text from the WSJ

• Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament.

Page 5: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

5

Low Level Issues

• Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc.

• Normalisation: deciding on standard character representations; adopting upper or lower case (or both)

• Tokenisation

Page 6: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

6

Tokenisation

• Tokenisation is a process which divides input text into individual units called tokens.

• Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information.

• An example of such information is the type of the token: word, punctuation, number

Page 7: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

7

What counts as a word?

• Words are quite tricky to define

• The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967)

• It is easy to find exceptions.

Page 8: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

8

Problems Identifying Words

VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London)

• VfB Stuttgart, Manchester United• succession• 2-1• Wednesday

Page 9: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

9

Problems Identifying WordsProblems Involving Spaces

• Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx

• The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457

Page 10: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

10

Problems Involving Special Characters

• Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-)

• Words are often terminated by punctuation which is not part of the word.

• Sometimes, terminating punctuation is part of the word.

Page 11: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

11

Periods

• In general, punctuation marks attach to words, and can be removed. However there are special cases:

• Most periods mark end of sentence• Others mark abbreviations, e.g. "e.g.".

"Wash."• Note that when an abbreviation occurs at

the end of a sentence there is only one period.

Page 12: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

12

Apostrophe

• English contractions such as won't or I'll count as one word according to the classic definition

• However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP)

• Penn Treebank splits such contractions into two words.

Page 13: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

13

Apostrophe

• This sometimes leaves odd wordsFor example isn’t yields is + n't

• 's is ambiguous– Abbreviation for is (he's strange)– Possessive (John's car)

• Word-final aprostrophe is ambiguous– end of quotation– possessive of word ending in s

Page 14: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

14

Exercise

• How is the apostrophe used in Maltese

• How should a Maltese tokeniser deal with it?

Page 15: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

15

Hyphen

• Issue: do sequences of words joined by hyphens count as one word or more?

• Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed.

• Typesetting hyphens can be ambiguous• Lexical hyphens are usually kept

hi-fi• Hyphens – standing alone – are used as

punctuation.• Texts are often inconsistent in usage of hyphens

Page 16: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

16

Case

• Types vs. Tokens– How many tokens in the following sentence:

The cat chased the rat on the table– How many types?

• Tokenisation should correctly identify word types, i.e.– Tokens of the same type should be identified– Tokens of different type should be distinguished

• Case representation of ordinary words must be standardised.

Page 17: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

17

Case

• Heuristics– Map first character of a sentence to standard

case – Map all words in titles to lowercase

• Problems– Identification of sentence boundaries– Identification of proper names

Page 18: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

18

Normalisation

• Character representations.

• Converting all letters to lower or upper case

• Removing punctuation

• Removing letters with accent marks and other diacritics

• Expanding abbreviations

Page 19: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

19

Further Normalisation

• Stemming: are eats and eating different words?

• They are two different wordforms

• that have the same stem, eat, but different suffixes, -s and -ing

• Stemming versus full morphological analysis.

Page 20: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

20

Summary

• The tokenisation problem interacts with design decisions at different levels concerning– Handling of non alphanumeric characters– Case– Punctuation

• Typically many of these problems are dealt with by hand crafting special rules which match a particular case.

• Such rules are often built out of regular expressions.

Page 21: March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006 Introduction to Computational Linguistics

21

Sources

Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999