comp90042 trevor cohn 1 wsta lecture 15 tagging with hmms tagging sequences modelling concepts ...

30
COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences modelling concepts Markov chains and hidden Markov models (HMMs) decoding: finding best tag sequence Applications POS tagging entity tagging shallow parsing Extensions unsupervised learning supervised learning of Conditional Random Field models

Upload: muriel-gilmore

Post on 03-Jan-2016

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn1

WSTA Lecture 15Tagging with HMMs

Tagging sequences

modelling concepts

Markov chains and hidden Markov models (HMMs)

decoding: finding best tag sequence

Applications POS tagging

entity tagging

shallow parsing

Extensions unsupervised learning

supervised learning of Conditional Random Field models

Page 2: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn2

Markov Chains

• Useful trick to decompose complex chain of events

- into simpler, smaller modellable events

• Already seen MC used:

- for link analysis

- modelling page visits by random surfer

- Pr(visiting page) only dependent on last page visited

- for language modelling:

- Pr(next word) dependent on last (n-1) words

- for tagging

- Pr(tag) dependent on word and last (n-1) tags

Page 3: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn3

Hidden tags in POS tagging

• In sequence labelling we don’t know the last tag…

- get sequence of M words, need to output M tags (classes)

- the tag sequence is hidden and must be inferred

- nb. may have tags for training set, but not in general (testing set)

Page 4: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn4

Predicting sequences

• Have to produce sequence of M tags

• Could treat this as classification…

- e.g., one big ‘label’ encoding the full sequence

- but there are exponentially many combinations, |Tags|M

- how to tag sequences of differing lengths?

• A solution: learning a local classifier

- e.g., Pr(tn | wn, tn-2, tn-1) or P(wn, tn | tn-1)

- still have problem of finding best tag sequence for w

- can we avoid the exponential complexity?

Page 5: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn5

Markov Models

Characterised by

set of states

initial state occ prob

state transition probs

outgoing edges normalised

Can score sequences of observations

For stock price example: up-up-down-up-up

maps directly to states

simply multiply probs

Fig. from Spoken language processing; Huang, Acero, Hon (2001); Prentice Hall

Page 6: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn6

Hidden Markov Models

Each state now has in addition

emission prob vector

No longer 1:1 mapping

from observation sequence to states

E.g., up-up-down-up-up could be generated from any state sequence

but some more likely than others!

State sequence is ‘hidden’

Fig. from Spoken language processing; Huang, Acero, Hon (2001); Prentice Hall

Page 7: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn7

Notation

• Basic units are a sequence of

- O, observations e.g., words

- Ω, states e.g., POS tags

• Model characterised by

- initial state probs π = vector of |Ω| elements

- transition probs A = matrix of |Ω| x |Ω|

- emission probs O = matrix of |Ω| x |O|

• Together define the probability of a sequence

- of observations together with their tags

- a model of P(w, t)

• Notation: w = observations; t = tags; i = time index

Page 8: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn8

Assumptions

• Two assumptions underlying the HMM

• Markov assumption

- states independent of all but most recent state

- P(ti | t1, t2, t3, …, ti-2, ti-1) = P(ti | ti-1)

- state sequence is a Markov chain

• Output independence

- outputs dependent only on matching state

- P(wi | w1, t1, …, wi-1, ti-1, ti) = P(wi | ti)

- forces the state ti to carry all information linking wi with neighbours

• Are these assumptions realistic?

Page 9: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn9

Probability of sequence

• Probability of sequence “up-up-down”

- 1,1,2 is the highest prob hidden sequence

- total prob is 0.054398, not 1 - why??

State seq

up π O

upA O

downA O total

1,1,1 0.5 x 0.7 x 0.6 x 0.7 x 0.6 x 0.1 = 0.00882

1,1,2 0.5 x 0.7 x 0.6 x 0.7 x 0.2 x 0.6 = 0.01764

1,1,3 0.5 x 0.7 x 0.6 x 0.7 x 0.2 x 0.3 = 0.00882

1,2,1 0.5 x 0.7 x 0.2 x 0.1 x 0.5 x 0.1 = 0.00035

1,2,2 0.5 x 0.7 x 0.2 x 0.1 x 0.3 x 0.6 = 0.00126

3,3,3 0.3 x 0.3 x 0.5 x 0.3 x 0.5 x 0.3 = 0.00203

Page 10: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn10

HMM Challenges

• Given observation sequence(s)

- e.g., up-up-down-up-up

• Decoding Problem

- what states were used to create this sequence?

• Other problems

- what is the probability of this observation sequence under any state sequence

- how can we learn the parameters of the model from sequence data, without labelled data, i.e., the states are hidden?

Page 11: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn11

HMMs for tagging

Recall part-of-speech tagging time/Noun flies/Verb like/Prep an/Art arrow/Noun

What are the units? words = observations

tags = states

Key challenges estimate model from state-supervised data

e.g., based on frequencies denoted “Visible Markov Model” in MRS text

Averb,noun = how often does Verb follow Noun, versus other tags?

prediction for full tag sequences

Page 12: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn12

Example

time/Noun flies/Verb like/Prep an/Art arrow/Noun Prob = P(Noun) P(time | Noun) ⨉

P(Verb | Noun) P(flies | Verb) ⨉P(Prep | Verb) P(like | Prep) ⨉P(Art | Prep) P(an | Art) ⨉P(Noun | Art) P(arrow | Noun)

… for the other reading Prob = P(Noun) P(time | Noun) ⨉

P(Noun | Noun) P(flies | Noun) ⨉P(Verb | Noun) P(like | Verb) ⨉P(Art | Prep) P(an | Art) ⨉P(Noun | Art) P(arrow | Noun)

Which do you think is more likely?

Page 13: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn13

Estimating a visible Markov tagger

• Estimation

- what values to use for P(w | t)?

- what values to use for P(ti | ti-1) and P(t1)?

- how about simple frequencies, i.e.,

- (probably want to smooth these, e.g., adding 0.5)

Page 14: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn14

Prediction

• Prediction

- given a sentence, w, find the sequence of tags, t

- problems

- exponential number of values of t

- but computation can be factorised…

Page 15: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn15

Viterbi algorithm

• Form of dynamic programming to solve maximisation

- define matrix α of size M (length) x T (tags)

- full sequence max is then

- how to compute α?

• Note: we’re interested in the arg max, not max

- this can be recovered from α with some additional book-keeping

Page 16: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn16

Defining α

• Can be defined recursively

• Need a base case to terminate recursion

Page 17: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn17

Viterbi illustration

start

0.5 x 0.7 = 0.35

1

2

3

0.2 x 0.1 = 0.02

0.3 x 0.3 = 0.09

1

2

3

1

2

3

0.09 x 0.4 x 0.7 = 0.0252

0.02 x 0.5 x 0.7 = 0.007

0.35 x 0.6 x 0.7 = 0.1470.147 x …

0.147 x …

0.147 x …

All maximising sequences with t2=1 must also have t1=1

No need to consider extending [2,1] or [3,1].

Page 18: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn18

Viterbi analysis

• Algorithm as follows

• Time complexity is O(M T2)

- nb. better to work in log-space, adding log probabilities

alpha = np.zeros(M, T)for t in range(T): alpha[1, t] = pi[t] * O[w[1], t] for i in range(2, M): for t_i in range(T): for t_last in range(T): alpha[i,t_i] = max(alpha[i,t_i], alpha[i-1, t_last] *

A[t_last, t_i] * O[w[i], t_i])

best = np.max(alpha[M,:])

Page 19: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn19

Backpointers

• Don’t just store max values, α

- also store the argmax `backpointer’

- can recover best tM-1 from δ[M,tM]

- and tM-2 from δ[M-1,tM-1]

- …

- stopping at t1

Page 20: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn20

MEMM taggers

• Change in HMM parameterisation from generative to discriminative

- change from Pr(w, t) to Pr(t | w); and

- change from Pr(ti | ti-1) Pr(wi | ti) to Pr(ti | wi, ti-1)

• E.g. time/Noun flies/Verb like/Prep …

Prob = P(Noun | time) P(Verb | Noun, flies) P(Prep | Verb, ⨉ ⨉like)

Modelled using maximum entropy classifier supports rich feature set over sentence, w, not just current

word

features need not be independent

Simpler sibling of conditional random fields (CRFs)

Page 21: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn21

CRF taggers

• Take idea behind MEMM taggers

- change `softmax’ normalisation of probability distributions

- rather than normalising each transition Pr(ti | wi, ti-1)

- normalise over full tag sequence

- Z can be efficiently computed using HMM’s forward-backward algo

Z is a sum over tags ti

Z is a sum over tag sequences t

Page 22: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn22

CRF taggers vs MEMMs

• Observation bias problem

- given a few bad tagging decisions model doesn’t know how to process the next word

- All/DT the/?? indexes dove

- only option is to make a guess

- tag the/DT, even though we know DT-DT isn’t valid

- would prefer to back-track, but have to proceed (probs must sum to 1)

- known as the ‘observation bias’ or ‘label bias problem’

- Klein D, and Manning, C. "Conditional structure versus conditional estimation in NLP models.” EMNLP 2002.

• Contrast with CRF’s global normalisation

- can give all outgoing transitions low scores (no need to sum to 1)

- these paths will result in low probabilities after global normalisation

Page 23: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn23

Aside: unsupervised HMM estimation

• Learn model without only words, no tags

- Baum-Welch algorithm, maximum likelihood estimation for

P(w) = ∑t P(w, t)

- variant of the Expectation Maximisation algorithm

- guess model params (e.g., random)

- 1) estimate tagging of the data (softly, to find expections)

- 2) re-estimate model params

- repeat steps 1 & 2

- requires forward-backward algorithm for step 1 (see reading)

- formulation similar to Viterbi algorithm, max → sum

• Can be used for tag induction, with some success

Page 24: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn24

HMMs in NLP

• HMMs are highly effective for part-of-speech tagging

- trigram HMM gets 96.7% accuracy (Brants, TNT, 2000)

- related models are state of the art

- feature based techniques based on logistic regression & HMMs

- Maximum entropy Markov model (MEMM) gets 97.1% accuracy (Ratnaparki, 1996)

- Conditional random fields (CRFs) can get up to ~97.3%

- scores reported on English Penn Treebank tagging accuracy

• Other sequence labelling tasks

- named entity recognition

- shallow parsing …

Page 25: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn25

Information extraction

• Task is to find references to people, places, companies etc in text

- Tony Abbott [PERSON] has declared the GP co-payment as “dead, buried and cremated'’ after it was finally dumped on Tuesday [DATE].

• Applications in

- text retrieval, text understanding, relation extraction

• Can we frame this as a word tagging task?

- not immediately obvious, as some entities are multi-word

- one solution is to change the model

- hidden semi-Markov models, semi-CRFs can handle multi-word observations

- easiest to map to a word-based tagset

Page 26: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn26

IOB sequence labelling

• BIO labelling trick applied to each word

- B = begin entity

- I = inside (continuing) entity

- O = outside, non-entity

• E.g.,

- Tony/B-PERSON Abbott/I-PERSON has/O declared/O the/O GP/O co-payment/O as/O “/O dead/O , /O buried/O and/O cremated/O ’’/O after/O it/O was/O finally/O dumped/O on/O Tuesday/B-DATE ./O

• Analysis

- allows for adjacent entities (e.g., B-PERSON B-PERSON)

- expands the tag set by a small factor, efficiency and learning issues

- often use B-??? only for adjacent entities, not after O label

Page 27: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn27

Shallow parsing

• Related task of shallow or ‘chunk’ parsing

- fragment the sentence into parts

- noun, verb, prepositional, adjective etc phrases

- simple non-hierarchical setting, just aiming to find core parts of each phrase

- supports simple analysis, e.g., document search for NPs or finding relations from co-occuring NPs and VPs

• E.g.,

- [He NP] [reckons VP] [the current account deficit NP] [will narrow VP] [to PP] [only # 18 billion NP] [in PP] [September NP] .

- He/B-NP reckons/B-VP the/B-NP current/I-NP account/I-NP deficit/I-NP will/B-VP narrow/I-VP to/B-PP only/B-NP #/I-NP 1.8/I-NP billion/I-NP in/B-PP September/B-NP ./O

Page 28: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn28

CoNLL competitions

• Shared tasks at CoNLL

- 2000: shallow parsing evaluation challenge

- 2003: named entity evaluation challenge

- more recently have considered parsing, semantic role labelling, relation extraction, grammar correction etc.

• Sequence tagging models predominate

- shown to be highly effective

- key challenge is incorporating task knowledge in terms of clever features

- e.g., gazetteer features, capitalisation, prefix/suffix etc

- limited ability to do so in a HMM

- feature based models like MEMM and CRFs are more flexible

Page 29: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn29

Summary

• Probabilistic models of sequences

- introduced HMMs, a widely used model in NLP and many other fields

- supervised estimation for learning

- Viterbi algorithm for efficient prediction

- related ideas (not covered in detail)

- unsupervised learning with HMMs

- CRFs and other feature based sequence models

- applications to many NLP tasks

- named entity recognition

- shallow parsing

Page 30: COMP90042 Trevor Cohn 1 WSTA Lecture 15 Tagging with HMMs Tagging sequences  modelling concepts  Markov chains and hidden Markov models (HMMs)  decoding:

COMP90042 Trevor Cohn30

Readings

Choose one of the following on HMMs: Manning & Schutze, chapters 9 (9.1-9.2, 9.3.2) & 10 (10.1-

10.2)

Rabiner’s HMM tutorial http://tinyurl.com/2hqaf8

Shallow parsing and named entity tagging CoNLL competitions in 2000 and 2003 overview papers

http://www.cnts.ua.ac.be/conll2000 http://www.cnts.ua.ac.be/conll2003/

[Optional] Contemporary sequence tagging methods Lafferty et al, Conditional random fields: Probabilistic

models for segmenting and labeling sequence data (2001), ICML

Next lecture lexical semantics and word sense disambiguation