word classes and part of speech tagging reading: chap 5, jurafsky & martin

47
Word classes and part of speech tagging Reading: Chap 5, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging

Upload: chibale

Post on 24-Feb-2016

58 views

Category:

Documents


0 download

DESCRIPTION

Word classes and part of speech tagging Reading: Chap 5, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from Chris Brew ’ s (OSU) slides on part of speech tagging. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Word classes and part of speech taggingReading: Chap 5, Jurafsky & Martin

Instructor: Paul Tarau, based on Rada Mihalcea’s original slides

Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging

Page 2: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 2

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 3: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 3

Definition

“The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin)

thegirlkissedtheboyonthecheek

WORDSTAGS

NVPDET

Page 4: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 4

An Example

thegirlkisstheboyonthecheek

LEMMA TAG

+DET+NOUN+VPAST+DET+NOUN+PREP+DET+NOUN

thegirlkissedtheboyonthecheek

WORD

Page 5: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 5

Motivation

Speech synthesis — pronunciationSpeech recognition — class-based N-gramsInformation retrieval — stemming, selection high-

content wordsWord-sense disambiguationCorpus analysis of language & lexicography

Page 6: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 6

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 7: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 7

Word Classes

Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, …

Open vs. Closed classesOpen:

Nouns, Verbs, Adjectives, Adverbs. Why “open”?

Closed: determiners: a, an, thepronouns: she, he, Iprepositions: on, under, over, near, by, …

Page 8: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 8

Open Class Words

Every known human language has nouns and verbs

Nouns: people, places, thingsClasses of nouns

proper vs. commoncount vs. mass

Verbs: actions and processesAdjectives: properties, qualitiesAdverbs: hodgepodge!

Unfortunately, John walked home extremely slowly yesterday

Numerals: one, two, three, third, …

Page 9: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 9

Closed Class Words

Differ more from language to language than open class words

Examples:prepositions: on, under, over, …particles: up, down, on, off, …determiners: a, an, the, …pronouns: she, who, I, ..conjunctions: and, but, or, …auxiliary verbs: can, may should, …

Page 10: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 10

Prepositions from CELEX

Page 11: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 11

English Single-Word Particles

Page 12: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 12

Pronouns in CELEX

Page 13: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 13

Conjunctions

Page 14: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 14

Auxiliaries

Page 15: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 15

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 16: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 16

Word Classes: Tag Sets

• Vary in number of tags: a dozen to over 200• Size of tag sets depends on language, objectives

and purpose– Some tagging approaches (e.g., constraint grammar

based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles

– Simple morphology = more ambiguity = fewer tags

Page 17: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 17

Word Classes: Tag set example

PRPPRP$

Page 18: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 18

Example of Penn Treebank Tagging of Brown Corpus Sentence

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .Book that flight .

VBZ DT NN VB NN ?Does that flight serve dinner ?

See http://www.infogistics.com/posdemo.htm

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

Page 19: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 19

The Problem

Words often have more than one word class: thisThis is a nice day = PRPThis day is nice = DTYou can go this far = RB

Page 20: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 20

Word Class Ambiguity(in the Brown Corpus)

Unambiguous (1 tag): 35,340Ambiguous (2-7 tags): 4,100

2 tags 3,7603 tags 2644 tags 615 tags 126 tags 27 tags 1

(Derose, 1988)

Page 21: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 21

Part-of-Speech Tagging

• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)

• Stochastic Tagger: HMM-based• Transformation-Based Tagger (Brill)

Page 22: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 22

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 23: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 23

Rule-Based Tagging

• Basic Idea:– Assign all possible tags to words– Remove tags according to set of rules of type: if

word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv.

– Typically more than 1000 hand-written rules, but may be machine-learned.

Page 24: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 24

Sample ENGTWOL LexiconDemo: http://www2.lingsoft.fi/cgi-bin/engtwol

Page 25: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 25

Stage 1 of ENGTWOL Tagging

First Stage: Run words through Kimmo-style morphological analyzer to get all parts of speech.

Example: Pavlov had shown that salivation …Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

Page 26: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 26

Stage 2 of ENGTWOL Tagging

Second Stage: Apply constraints.Constraints used in negative way.Example: Adverbial “that” rule

Given input: “that”If

(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)

Then eliminate non-ADV tagsElse eliminate ADV

Example constraint for clear

Page 27: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 27

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 28: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 28

Stochastic Tagging

• Based on probability of certain tag occurring given various possibilities

• Requires a training corpus• No probabilities for words not in corpus.• Training corpus may be different from test

corpus.

Page 29: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 29

Stochastic Tagging (cont.)

•Simple Method: Choose most frequent tag in training text for each word!

– Result: 90% accuracy– Baseline– Others will do better– HMM is an example

Page 30: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 30

HMM Tagger

• Intuition: Pick the most likely tag for this word.• HMM Taggers choose tag sequence that

maximizes this formula:– P(word|tag) × P(tag|previous n tags)

• Let T = t1,t2,…,tn

Let W = w1,w2,…,wn

• Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W.

Page 31: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 31

Start with Bigram-HMM Tagger

argmaxT P(T|W)argmaxTP(T)P(W|T)argmaxtP(t1…tn)P(w1…wn|t1…tn)argmaxt[P(t1)P(t2|t1)…P(tn|tn-1)][P(w1|t1)P(w2|t2)…P(wn|tn)]To tag a single word: ti = argmaxj P(tj|ti-1)P(wi|tj)How do we compute P(ti|ti-1)?

c(ti-1ti)/c(ti-1)How do we compute P(wi|ti)?

c(wi,ti)/c(ti)How do we compute the most probable tag sequence?

Viterbi

Page 32: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 32

Markov Model Taggers

Bigram taggerMake predictions based on the preceding tagThe basic unit is the preceding tag and the current tag

Trigram taggerWe would expect more accurate predictions if more

context is taken into accountRB(adverb) VBD(past tense) Vs RB VBN(past participle) ?

Ex) “clearly marked”Is clearly marked : P(BEZ RB VBN) > P(BEZ RB VBD)He clearly marked : P(PN RB VBD) > P(PN RB VBN)

Page 33: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 33

An Example

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN

People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

to/TO race/???the/DT race/???ti = argmaxj P(tj|ti-1)P(wi|tj)

max[P(VB|TO)P(race|VB) , P(NN|TO)P(race|NN)]Brown:

P(NN|TO) = .021 × P(race|NN) = .00041 = .000007P(VB|TO) = .34 × P(race|VB) = .00003 = .00001

Page 34: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 34

An Early Approach to Statistical POS Tagging• PARTS tagger (Church, 1988): Stores probability

of tag given word instead of word given tag.• P(tag|word) × P(tag|previous n tags)• Compare to:

– P(word|tag) × P(tag|previous n tags)

Page 35: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 35

PARTS vs HMM

What is the main difference between PARTS tagger (Church) and the HMM tagger?

C(water) = 1000C(NN) = 5,000,000C(VB) = 1,000,000C(water,NN) = 700C(water,VB) = 300

Page 36: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 36

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 37: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 37

Transformation-Based Tagging (Brill Tagging)• Combination of Rule-based and stochastic tagging

methodologies– Like rule-based because rules are used to specify tags in

a certain environment– Like stochastic approach because machine learning is

used—with tagged corpus as input• Input:

– tagged corpus– dictionary (with most frequent tags)

• Usually constructed from the tagged corpus

Page 38: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 38

Transformation-Based Tagging (cont.)• Basic Idea:

– Set the most probable tag for each word as a start value– Change tags according to rules of type “if word-1 is a

determiner and word is a verb then change the tag to noun” in a specific order

• Training is done on tagged corpus:– Write a set of rule templates– Among the set of rules, find one with highest score– Continue from 2 until lowest score threshold is passed– Keep the ordered set of rules

• Rules make errors that are corrected by later rules

Page 39: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 39

TBL Rule Application

Tagger labels every word with its most-likely tagFor example: race has the following probabilities in the

Brown corpus:P(NN|race) = .98P(VB|race)= .02

Transformation rules make changes to tags“Change NN to VB when previous tag is TO”

… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Page 40: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 40

TBL: Rule Learning

2 parts to a ruleTriggering environmentRewrite rule

The range of triggering environments of templates (from Manning & Schutze 1999:363)

Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

1 *2 *3 *4 *5 *6 *7 *8 *9 *

Page 41: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 41

TBL: The Algorithm

• Step 1: Label every word with most likely tag (from dictionary)

• Step 2: Check every possible transformation & select one which most improves tagging

• Step 3: Re-tag corpus applying the rules• Repeat 2-3 until some criterion is reached, e.g.,

X% correct with respect to training corpus• RESULT: Sequence of transformation rules

Page 42: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 42

TBL: Rule Learning (cont’d)

• Problem: Could apply transformations ad infinitum!

• Constrain the set of transformations with “templates”:– Replace tag X with tag Y, provided tag Z or word Z’

appears in some position• Rules are learned in ordered sequence • Rules may interact.• Rules are compact and can be inspected by

humans

Page 43: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 43

Templates for TBL

Page 44: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 44

TBL: Problems

• Execution Speed: TBL tagger is slower than HMM approach– Solution: compile the rules to a Finite State Transducer

(FST)• Learning Speed: Brill’s implementation over a day

(600k tokens)

Page 45: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 45

Outline

Why part of speech tagging?Word classesTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic taggingAutomatic approaches 3: transformation-based taggingOther issues: tagging unknown words, evaluation

Page 46: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 46

Tagging Unknown Words

• New words added to (newspaper) language 20+ per month

• Plus many proper names …• Increases error rates by 1-2%

• Method 1: assume they are nouns• Method 2: assume the unknown words have a

probability distribution similar to words only occurring once in the training set.

• Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

Page 47: Word classes and part of speech tagging Reading: Chap 5,  Jurafsky  & Martin

Slide 47

Evaluation

• The result is compared with a manually coded “Gold Standard”– Typically accuracy reaches 96-97%– This may be compared with result for a baseline tagger

(one that uses no context).• Important: 100% is impossible even for human

annotators.

• Factors that affects the performance– The amount of training data available– The tag set– The difference between training corpus and test corpus– Dictionary– Unknown words