ling 388: language and computers sandiway fong lecture 23: 11/15

31
LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

LING 388: Language and Computers

Sandiway Fong

Lecture 23: 11/15

Page 2: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Part-of-Speech (POS) Tagging

• Basic Idea:– assign the right part-of-speech tag, e.g. noun, verb, conjunction, to

a word– useful for shallow parsing – or as first stage of a deeper/more sophisticated system

• Question:– Is it a hard task?

• i.e. can’t we just look the words up in a dictionary?

• Answer:– Yes.

• Ambiguity.

– No. • POS tagging programs typically claim 95%+ accuracy

Page 3: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

POS Tagging

• Task:– assign the right part-of-speech tag to a word in context– not always easy

• Example: walk– the walk : noun I took …– I walk : verb 2 miles every day

• Example: still: noun, adjective, adverb, verb– the still of the night, a glass still– still waters– stand still– still struggling– Still, I didn’t give way– still your fear of the dark (transitive)– the bubbling waters stilled (intransitive)

Page 4: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

POS Tagging

• Issues/Questions:– What are the parts of speech and

subclasses that we might want to tag?– What does a typical tagset look like?– What methods can we use to assign tags?

Page 5: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– nouns (open-class: unlimited set)

• referential items (denoting objects/concepts etc.)– proper nouns: John– pronouns: he, him, she, her, it– anaphors: himself, herself (reflexives)– common nouns: dog, dogs, water

» number: dog (singular), dogs (plural)» count-mass distinction: many dogs, *many waters

– eventive nouns: dismissal, concert, playback, destruction (deverbal)

• nonreferential items– it as in it is important to study– there as in there seems to be a problem– some languages don’t have these: e.g. Japanese

• open-class– factoid, email, bush-ism

Page 6: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Pronouns:– it– I– he– you– his– they– this– that– she– her– we– all– which– their– what

Page 7: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– verbs (closed-class: fixed set)

• auxiliaries– be (passive, progressive)– have (pluperfect tense)– do (what did John buy?, Did Mary win?)– modals: can, could, would, will, may

• Irregular: – is, was, were, does, did

Page 8: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– verbs (open-class: unlimited set)

• Intransitive– unaccusatives: arrive (achievement)– unergatives: run, jog (activities)

• Transitive– actions: hit (semelfactive: hit the ball for an hour)– actions: eat, destroy (accomplishment)– psych verbs: frighten (x frightens y), fear (y fears x)

• Ditransitive– put (x put y on z, *x put y)– give (x gave y z, *x gave y, x gave z to y)– load (x loaded y (on z), x loaded z (with y))

– Open-class: • reaganize, email, fax

Page 9: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– adjectives (open-class: unlimited set)

• modify nouns• black, white, open, closed, sick, well• attributive: black (black car, car is black), main (main street, *street is main),

atomic• predicative: afraid (*afraid child, the child is afraid)• stage-level: drunk (there is a man drunk in the pub)• individual-level: clever, short, tall (*there is a man tall in the bar)• object-taking: proud (proud of him,*well of him)• intersective: red (red car: intersection of the set of red things and the set of cars)• non-intersective: former (former architect), atomic (atomic scientist)• comparative, superlative: blacker, blackest, *opener, *openest

– open-class:• hackable, spammable

Page 10: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– adverbs (open-class: unlimited set)

• modify verbs (adjectives and other adverbs)• manner: slowly (moved slowly)• degree: slightly, more (more clearly), very (very bad), almost• sentential: unfortunately, suddenly• question: how• temporal: when, soon, yesterday (noun?)• location: sideways, here (John is here)

– open-class:• spam-wise

Page 11: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Parts-of-Speech

• Divide words into classes based on grammatical function– prepositions (closed-class: fixed set)– come before an object, assigns a semantic function (from Mars, *Mars from)

• head-final languages: postpositions (Japanese: amerika-kara)

– location: on, in, by– temporal: by, until

Page 12: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

POS Tagging

• Task:– assign the right part-of-speech tag, e.g. noun, verb,

conjunction, to a word in context

• POS taggers– need to be fast in order to process large corpora

• should take no more than time linear in the size of the corpora– full parsing is slow

• e.g. context-free grammar n3, n length of the sentence– POS taggers try to assign correct tag without actually

parsing the sentence

Page 13: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

POS Tagging

• Components:– Dictionary of words

• Exhaustive list of closed class items– Examples:

» the, a, an: determiner» from, to, of, by: preposition» and, or: coordination conjunction

• Large set of open class (e.g. noun, verbs, adjectives) items with frequency information

Page 14: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

POS Tagging

• Components:– Mechanism to assign tags

• Context-free: by frequency• Context: bigram, trigram, HMM, hand-coded rules

– Example:» Det Noun/*Verb the walk…

– Mechanism to handle unknown words (extra-dictionary)• Capitalization• Morphology: -ed, -tion

Page 15: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

How Hard is Tagging?

• Brown Corpus (Francis & Kucera, 1982):– 1 million words– 39K distinct words– 35K words with only 1 tag– 4K with multiple tags (DeRose, 1988)

Page 16: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

How Hard is Tagging?

• Easy task to do well on:– naïve algorithm

• assign tag by frequency

– 90% accuracy (Charniak et al., 1993)

Page 17: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagset

• 48-tag simplification of Brown Corpus tagset• Examples:

1. CC Coordinating conjunction

3. DT Determiner

7. JJ Adjective

11. MD Modal

12. NN Noun (singular,mass)

13. NNS Noun (plural)

27 VB Verb (base form)

28 VBD Verb (past)

Page 18: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html

1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction 7 JJ Adjective8 JJR Adjective, comparative9 JJS Adjective, superlative

10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)

Page 19: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html

25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote

$

Page 20: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagset

• How many tags?– Tag criterion

• Distinctness with respect to grammatical behavior?

– Make tagging easier?

• Punctuation tags – Penn Treebank numbers 37- 48

• Trivial computational task

Page 21: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagset

• Simplifications:– Tag TO:

• infinitival marker, preposition• I want to win• I went to the store

– Tag IN:• preposition: that, when, although • I know that I should have stopped, although…• I stopped when I saw Bill

Page 22: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Penn TreeBank Tagset

• Simplifications:– Tag DT:

• determiner: any, some, these, those• any man• these *man/men

– Tag VBP: • verb, present: am, are, walk• Am I here?• *Walked I here?/Did I walk here?

Page 23: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Hard to Tag Items

• Syntactic Function– Example:

• resultative

• I saw the man tired from running • Examples (from Brown Corpus Manual)

– Hyphenation:• long-range, high-energy• shirt-sleeved • signal-to-noise

– Foreign words:• mens sana in corpore sano

Page 24: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Example Systems– ENGCG (1,100 rules)

• http://www.lingsoft.fi/cgi-bin/engcg – ENGCG-2 (4000 rules)

• http://www.connexor.com/demos/tagger_en.html

• Core Components– English morphological analyzer based on two-level morphology

• see last lecture

– 56K word stems– processing

• apply morphological engine• get all possible tags for each word• apply rules

Page 25: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Example:– Pavlov had

shown that salivation can be a conditioned reflex

Page 26: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Examples of tags:– PCP2 past

participle– SV subject

verb– SVOO

subject verb object object

Page 27: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Example:– it isn’t that:adv odd

• Rule:– given input “that”– if

• (+1 A/ADV/QUANT)• (+2 SENT-LIM)• (NOT -1 SVOC/A)

– then eliminate non-ADV tags– else eliminate ADV tag

Page 28: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Now ENGCG-2 (4000 rules)– http://www.connexor.com/demos/tagger_en.html

Page 29: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Now ENGCG-2 (4000 rules)– http://www.connexor.com/demos/tagger_en.html

Page 30: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Rule-Based POS Tagging

• Best performance of all systems: 99.7%

Page 31: LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15

Next Time

• Look at statistical techniques …