word classes and part of speech tagging chapter 5

Download Word classes and part of speech tagging Chapter 5

Post on 01-Feb-2016

48 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Word classes and part of speech tagging Chapter 5. Outline. Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging On Part 2: finish stochastic tagging, and continue on to: evaluation. An Example. WORD. LEMMA. TAG. the - PowerPoint PPT Presentation

TRANSCRIPT

  • Word classes and part of speech tagging

    Chapter 5

    Slide *

    OutlineTag sets and problem definitionAutomatic approaches 1: rule-based taggingAutomatic approaches 2: stochastic tagging

    On Part 2: finish stochastic tagging, and continue on to:evaluation

    Slide *

    An ExamplethegirlkisstheboyonthecheekLEMMATAG+DET+NOUN+VPAST+DET+NOUN+PREP+DET+NOUNthegirlkissedtheboyonthecheekWORD

    Slide *

    Word Classes: Tag SetsVary in number of tags: a dozen to over 200Size of tag sets depends on language, objectives and purpose

    Slide *

    Word Classes: Tag set examplePRPPRP$

    Slide *

    Example of Penn Treebank Tagging of Brown Corpus SentenceThe/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

    VB DT NN . Book that flight .

    VBZ DT NN VB NN ? Does that flight serve dinner ?

    See http://www.infogistics.com/posdemo.htm

    Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

    Slide *

    The ProblemWords often have more than one word class: thisThis is a nice day = PRPThis day is nice = DTYou can go this far = RB

    Slide *

    Word Class Ambiguity(in the Brown Corpus)Unambiguous (1 tag): 35,340Ambiguous (2-7 tags): 4,100

    (Derose, 1988)

    2 tags3,7603 tags2644 tags615 tags126 tags27 tags1

    Slide *

    Part-of-Speech TaggingRule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)Stochastic Tagger: HMM-basedTransformation-Based Tagger (Brill) (we wont cover this)

    Slide *

    Rule-Based TaggingBasic Idea:Assign all possible tags to wordsRemove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like consider then eliminate non-adv else eliminate adv.Typically more than 1000 hand-written rules

    Slide *

    Sample ENGTWOL LexiconDemo: http://www2.lingsoft.fi/cgi-bin/engtwol

    Slide *

    Stage 1 of ENGTWOL TaggingFirst Stage: Run words through a morphological analyzer to get all parts of speech.Example: Pavlov had shown that salivation PavlovPAVLOV N NOM SG PROPER hadHAVE V PAST VFIN SVO HAVE PCP2 SVO shownSHOW PCP2 SVOO SVO SV thatADV PRON DEM SG DET CENTRAL DEM SG CS salivationN NOM SG

    Slide *

    Stage 2 of ENGTWOL TaggingSecond Stage: Apply constraints.Constraints used in negative way.Example: Adverbial that rule Given input: that If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV

    Slide *

    Stochastic TaggingBased on probability of certain tag occurring given various possibilitiesRequires a training corpusNo probabilities for words not in corpus.

    Slide *

    Stochastic Tagging (cont.)Simple Method: Choose most frequent tag in training text for each word!Result: 90% accuracyBaselineOthers will do betterHMM is an example

    Slide *

    HMM TaggerIntuition: Pick the most likely tag for this word.Let T = t1,t2,,tn Let W = w1,w2,,wnFind POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W.

    Slide *

    Toward a Bigram-HMM TaggerargmaxT P(T|W)argmaxTP(T)P(W|T)argmaxtP(t1tn)P(w1wn|t1tn)argmaxt[P(t1)P(t2|t1)P(tn|tn-1)][P(w1|t1)P(w2|t2)P(wn|tn)]To tag a single word: ti = argmaxj P(tj|ti-1)P(wi|tj)How do we compute P(ti|ti-1)?c(ti-1ti)/c(ti-1)How do we compute P(wi|ti)?c(wi,ti)/c(ti)How do we compute the most probable tag sequence?Viterbi

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Disambiguating race

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *ExampleP(NN|TO) = .00047P(VB|TO) = .83P(race|NN) = .00057P(race|VB) = .00012P(NR|VB) = .0027P(NR|NN) = .0012P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(NN|TO)P(NR|NN)P(race|NN)=.00000000032So we (correctly) choose the verb reading,

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Hidden Markov ModelsWhat weve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *DefinitionsA weighted finite-state automaton adds probabilities to the arcsThe sum of the probabilities on arcs leaving a node must sum to oneA Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go throughMarkov chains cant represent ambiguous problemsUseful for assigning probabilities to unambiguous sequences

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Markov Chain for Weather

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Markov Chain for Words

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Markov Chain: First-order observable Markov ModelA set of states Q = q1, q2qN; the state at time t is qtTransition probabilities: a set of probabilities A = a01a02an1ann. Each aij represents the probability of transitioning from state i to state jThe set of these is the transition probability matrix A

    Current state only depends on previous state

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *HMM for Ice CreamYou are a climatologist in the year 2799Studying global warmingYou cant find any records of the weather in Baltimore, MA for summer of 2007But you find Jason Eisners diaryWhich lists how many ice-creams Jason ate every date that summerOur job: figure out how hot it was

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Hidden Markov ModelFor Markov chains, the symbols are the same as the states.See hot weather: were in state hotBut in part-of-speech taggingThe output symbols are wordsBut the hidden states are part-of-speech tagsA Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.This means we dont know which state we are in.

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *States Q = q1, q2qN; Observations O= o1, o2oN; Each observation is a symbol from a vocabulary V = {v1,v2,vV}Transition probabilitiesTransition probability matrix A = {aij}

    Observation likelihoodsOutput probability matrix B={bi(k)}

    Special initial probability vector Hidden Markov Models

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Eisner TaskGivenIce Cream Observation Sequence: 1,2,3,2,2,2,3Produce:Weather Sequence: H,C,H,H,H,C

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *HMM for Ice Cream

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Transition Probabilities

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *Observation Likelihoods

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    * Speech and Language Processing - Jurafsky and Martin *DecodingOk, now we have a complete model that can give us what we need.

    We could just enumerate all paths given the input and use the model to assign probabilities to each.Not a good idea.Luckily dynamic programming (also seen in Ch. 3 with minimum edit distance, but we didnt cover it) helps us here

    Speech and Language Processing - Jurafsky and Martin

    Slide *

    Viterbi AlgorithmThe Viterbi algorithm is used to compute the most likely tag sequence in O(W x T2) time, where T is the number of possible part-of-speech tags and W is the number of words in the sentence.

    The algorithm sweeps through all the tag possibilities for each word, computing the best sequence leading to each possibility. The key that makes this algorithm efficient is that we only need to know the best sequences leading to the pre

Recommended

View more >