pos tagging - systems group · - named entity recognition, shallow parsing - word segmentation -...
TRANSCRIPT
POS taggingIntro to NLP - ETHZ - 17/03/2014
Summary
● Parts of speech ● Tagsets ● Part of speech tagging● HMM Tagging:
○ Most likely tag sequence○ Probability of an observation○ Parameter estimation○ Evaluation
POS ambiguity
"Squad helps dog bite victim"
● bite -> verb?● bite -> noun?
Parts of Speech (PoS)
Traditional parts of speech:● Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc.
● Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...
Examples
N (noun): car, squad, dog, bite, victimV (verb): help, biteADJ (adjective): purple, tallADV (adverb): unfortunately, slowlyP (preposition): of, by, toPRO (pronoun): I, me, mineDET (determiner): the, a, that, those
Open and closed classes
1. Closed class: small stable seta. Auxiliaries: may, can, will, been, ...b. Prepositions: of, in, by, ...c. Pronouns: I, you, she, mine, his, them, ...d. Usually function words, short, grammar role
2. Open class:a. new ones are created all the time ("to google/tweet",
"e-quaintance", "captcha", "cloud computing", "netbook", "webinar", "widget")
b. English has 4: Nouns, Verbs, Adjectives, Adverbsc. Many languages have these 4, not all!
Open class words
1. Nouns:a. Proper nouns: Zurich, IBM, Albert Einstein, The
Godfather, ... Capitalized in many languages.b. Common nouns: the rest, also capitalized in
German, mass/count nouns (goat/goats, snow/*snows)
2. Verbs:a. Morphological affixes in English: eat/eats/eaten
3. Adverbs: tend to modify things:a. John walked home extremely slowly yesterdayb. Directional/locative adverbs (here, home, downhill) c. Degree adverbs (extremely, very, somewhat)d. Manner adverbs (slowly, slinkily, delicately)
4. Adjectives: qualify nouns and noun phrases
Closed class words
1. prepositions: on, under, over, ... 2. particles: up, down, on, off, ... 3. determiners: a, an, the, ... 4. pronouns: she, who, I, ..5. conjunctions: and, but, or, ... 6. auxiliary verbs: can, may should, ...7. numerals: one, two, three, third, .
Prepositions with corpus frequencies
Conjunctions
Pronouns
Auxiliaries
Applications
● A useful pre-processing step in many tasks● Syntactic parsing: important source of
information for syntactic analysis● Machine translation● Information retrieval: stemming, filtering● Named entity recognition● Summarization?
Applications
Speech synthesis, for correct pronunciation of "ambiguous" words:
○ lead /lid/ (guide) vs. /led/ (chemical)○ inSULT vs. INsult○ obJECT vs. OBject○ overFLOW vs. OVERflow○ conTENT vs. CONtent
Summarization
● Idea: Filter out sentences starting with certain PoS tags
● Use PoS statistics from gold standard titles (might need cross-validation)
Summarization
● Idea: Filter out sentences starting with certain PoS tags:
● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions"
● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri"
● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't disappoint"
Summarization
● Idea: Filter out sentences starting with certain PoS tags:
● Title1: "Apple introduced Siri, an intelligent personal assistant to which you can ask questions"
● Title2: "Especially now that a popular new feature from Apple is bound to make other phones users envious: voice control with Siri"
● Title3: "But Siri, Apple's personal assistant application on the iPhone 4s, doesn't disappoint"
PoS tagging
● The process of assigning a part-of-speech tag (label) to each word in a text.
● Pre-processing: tokenizationWord Tag
Squad N
helps V
dog N
bite N
victim N
Choosing a tagset
1. There are many parts of speech, potential distinctions we can draw
2. For POS tagging, we need to choose a standard set of tags to work with
3. Coarse tagsets N, V, Adj, Adv. a. A universal PoS tagset? http://en.wikipedia.
org/wiki/Part-of-speech_tagging4. More commonly used set is finer grained,
the “Penn TreeBank tagset”, 45 tags
PTB tagset
Examples*
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples*
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP2. Does/VBZ this/DT flight/NN serve/VB dinner/NNS3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples*
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP2. Does/VBZ this/DT flight/NN serve/VB dinner/NN3. I/PRP have/VB a/DT friend/NN living/VBG in/IN Denver/NNP4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples*
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP2. Does/VBZ this/DT flight/NN serve/VB dinner/NN3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP4. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Examples*
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NNP2. Does/VBZ this/DT flight/NN serve/VB dinner/NN3. I/PRP have/VBP a/DT friend/NN living/VBG in/IN Denver/NNP4. Can/MD you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS
Complexities
● Book/VB that/DT flight/NN ./.● There/EX are/VBP 70/CD children/NNS
there/RB ./.● Mrs./NNP Shaefer/NNP never/RB got/VBD
around/RP to/TO joining/VBG ./.● All/DT we/PRP gotta/VBN do/VB is/VBZ
go/VB around/IN the/DT corner/NN ./.● Unresolvable ambiguity:
○ The Duchess was entertaining last night .
Words PoS WSJ PoS Universal
The DT DET
oboist NN NOUN
Heinz NNP NOUN
Holliger NNP NOUN
has VBZ VERB
taken VBN VERB
a DT DET
hard JJ ADJ
line NN NOUN
about IN ADP
the DT DET
problems NNS NOUN
. . .
POS Tagging
● Words often have more than one POS: back○ The back door = JJ○ On my back = NN○ Win the voters back = RB○ Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.
Word type tag ambiguity
Methods
1. Rule-baseda. Start with a dictionaryb. Assign all possible tagsc. Write rules by hand to remove tags in context
2. Stochastica. Supervised/Unsupervisedb. Generative/discriminativec. independent/structured outputd. HMMs
Rule-based tagging
1. Start with a dictionary:
she PRP
promised VBN, VBD
to TO
back VB, JJ, RB, NN
the DT
bill NN, VB
Rule-based tagging
2. Assign all possible tags:
NN
RB
VBN JJ NN
PRP VBD TO VB DT VB
she promised to back the bill
Rule-based tagging
3. Introduce rules to reduce ambiguity:
Rule: "<start> PRP {VBN, VBD}" -> "<start> PRP VBD"
NN
RB
VBN JJ NN
PRP VBD TO VB DT VB
she promised to back the bill
Rule-based tagging
3. Introduce rules to reduce ambiguity:
Rule: "TO {VB, NN, JJ, RB} DT" -> "TO VB DT"
NN
RB
VBN JJ NN
PRP VBD TO VB DT VB
she promised to back the bill
Statistical models for POS tagging
1. A classic: one of the first successful applications of statistical methods in NLP
2. Extensively studied with all possible approaches (sequence models benchmark)
3. Simple to get started on: data, eval, literature4. An introduction to more complex
segmentation and labelling tasks: NER, shallow parsing, global optimization
5. An introduction to HMMs, used in many variants in POS tagging and related tasks.
Supervision and resources
1. Supervised case: data with words manually annotated with POS tags
2. Partially supervised: annotated data + un-annotated data
3. Unsupervised: only raw text available4. Resources: dictionaries with words possible
tags5. Start with the supervised task
HMMs
HMM = (Q,O,A,B)1. States: Q=q1..qN [the part of speech tags]2. Observation symbols: O = o1..oV [words]3. Transitions:
a. A = {aij}; aij = P(ts=qj|ts-1=qi)b. ts | ts-1 = qi ~ Multi(ai)c. Special vector of initial/final probabilities
4. Emissions:a. B = {bik}; bik = P(ws = ok|ts=qi) b. ws| ts = qi ~ Multi(bi)
Markov Chain Interpretation
Tagging process as a (hidden) Markov process● Independence assumptions
1. Limited horizon
2. Time-invariant
3. Observation depends only on the state
Complete data likelihood
The joint probability of a sequence of words and tags, given a model:
Generative process:1. generate a tag sequence2. emit the words for each tag
Inference in HMMs
Three fundamental problems:1. Given an observation (e.g., a sentence) find
the most likely sequence of states (e.g., pos tags)
2. Given an observation, compute its probability
3. Given a dataset of observation (sequences) estimate the model's parameters:
theta = (A,B)
HMMs and FSA
HMMs and FSA
also Bayes nets, directed graphical models, etc.
Other applications of HMMs
• NLP - Named entity recognition, shallow parsing
- Word segmentation- Optical Character Recognition
• Speech recognition• Computer Vision – image segmentation• Biology - Protein structure predictionEconomics, Climatology, Robotics...
POS as sequence classification
● Observation: a sequence of N words w1:N● Response: a sequence of N tags t1:N ● Task: find the predicted t'1:N such that:
The best possible tagging for the sequence.
Bayes rule reformulation
HMM POS tagging
● How can we find t'1:N?● Enumeration of all possible sequences?
HMM POS tagging
● How can we find t'1:N?● Enumeration of all possible sequences?
○ O(|Tagset|N) !● Dynamic programming: Viterbi algorithm
Viterbi algorithm
Example: model
A+ =
B =
N V END
V 0.8 0.2 0.3
N 0.3 0.7 0.7
START 0.6 0.4
board backs plan
V 0.3 0.3 0.4
N 0.4 0.2 0.4
Example: observation
● Sentence: "Board backs plan"● Find the most likely tag sequence● ds(t) = probability of most likely path ending
at state s at time t
Viterbi algorithm: exampleEND
V
N
START
board backs plan
Time 1 2 3
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
Viterbi: forward passEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
d=.011
Viterbi: backtrackEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
d=.011
Viterbi: backtrackEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
d=.011
Viterbi: backtrackEND
V
N
START
board backs plan
Time 1 2 3
d=.12
d=.24 d=.019
d=.050 d=.005
d=.016
d=.011
Viterbi: outputEND
V
N
START
board/N backs/V plan/N
Time 1 2 3
d=.011
Observation probability
● Given HMM theta = (A,B) and observation sequence w1:N compute P(w1:N|theta)
● Applications: language modeling● Complete data likelihood:
● Sum over all possible tag sequences:
Forward algorithm
● Dynamic programming: each state of the trellis stores a value alphai(s) = probability of being in state s having observed w1:i
● Sum over all paths up to i-1 leading to s
● Init:
Forward algorithm
Forward computationEND
V
N
START
board backs plan
Time 1 2 3
a=.12
a=.24
Forward computationEND
V
N
START
board backs plan
Time 1 2 3
a=.12
a=.24 a=.034
a=.058
Forward computationEND
V
N
START
board backs plan
Time 1 2 3
a=.12
a=.24 a=.034
a=.058 a=.014
a=.022
Forward computationEND
V
N
START
board backs plan
Time 1 2 3
a=.12
a=.24 a=.034
a=.058 a=.014
a=.022
a=0.2
Parameter estimation
Maximum likelihood estimates (MLE) on data1. Transition probabilities:
2. Emission probabilities:
Implementation details
1. Start/End states2. Log space/Rescaling3. Vocabularies: model pruning4. Higher order models:
a. states representationb. Estimation and sparsity: deleted interpolation
Evaluation
● So once you have you POS tagger running how do you evaluate it?
1. Overall error rate with respect to a gold- standard test set.a. ER = # words incorrectly tagged/# words tagged
2. Error rates on particular tags (and pairs) 3. Error rates on particular words (especially
unknown words)
Evaluation
● The result is compared with a manually coded “Gold Standard”
● Typically accuracy > 97% on WSJ PTB● This may be compared with result for a
baseline tagger (one that uses no context).○ Baselines (most frequent tag) can achieve up to
90% accuracy.● Important: 100% is impossible even for
human annotators.
Summary
● Parts of speech ● Tagsets ● Part of speech tagging● HMM Tagging:
○ Most likely tag sequence (decoding)○ Probability of an observation (word sequence)○ Parameter estimation (supervised)○ Evaluation
Next class
● Unsupervised POS tagging models (HMMs)● Parameter estimation: forward-backward
algorithm● Discriminative sequence models: MaxEnt,
CRF, Perceptron, SVM, etc. ● Read J&M 5-6● Pre-process and POS tag the data: report
problems & baselines