part-of-speech tagging & parsing
Post on 23-Feb-2016
Embed Size (px)
DESCRIPTIONPart-of-Speech Tagging & Parsing. Chloé Kiddon (slides adapted and/or stolen outright from Andrew McCallum, Christopher Manning, and Julia Hockenmaier). Announcements. We do have a Hadoop cluster! It’s offsite. I need to know all groups who want it! - PowerPoint PPT Presentation
Part-of-Speech Tagging & Parsing
Part-of-Speech Tagging & ParsingChlo Kiddon
(slides adapted and/or stolen outright from Andrew McCallum, Christopher Manning, and Julia Hockenmaier)AnnouncementsWe do have a Hadoop cluster!Its offsite. I need to know all groups who want it!
You all have accounts for MySQL on the cubist machine (cubist.cs.washington.edu)Your folder is /projects/instr/cse454/a-f
Ill have a better email out this afternoon I hope
Grading HW1 should be finished by next week.Timely warningPOS tagging and parsing are two large topics in NLPUsually covered in 2-4 lectures
We have an hour and twenty minutes. Part-of-speech taggingOften want to know what part of speech (POS) or word class (noun,verb,) should be assigned to words in a piece of text
Part-of-speech tagging assigns POS labels to words
JJ JJ NNS VBP RB Colorless green ideas sleep furiously.Second line first. Why?4Why do we care?Parsing (come to later)
Speech synthesisINsult or inSULT, overFLOW or OVERflow, REad or reAD
Information extraction: entities, relationsRomeo loves Juliet vs. lost loves found again
Machine translationPenn Treebank Tagset1.CCCoordinating conjunction2.CDCardinal number3.DTDeterminer4.EXExistential there5.FWForeign word6.INPreposition or subordinating conjunction7.JJAdjective8.JJRAdjective, comparative9.JJSAdjective, superlative10.LSList item marker11.MDModal12.NNNoun, singular or mass13.NNSNoun, plural14.NPProper noun, singular15.NPSProper noun, plural16.PDTPredeterminer17.POSPossessive ending18.PPPersonal pronoun19.PP$Possessive pronoun20.RBAdverb21.RBRAdverb, comparative22.RBSAdverb, superlative23.RPParticle24.SYMSymbol25.TOto26.UHInterjection27.VBVerb, base form28.VBDVerb, past tense29.VBGVerb, gerund or present participle30.VBNVerb, past participle31.VBPVerb, non-3rd person singular present32.VBZVerb, 3rd person singular present33.WDTWh-determiner34.WPWh-pronoun35.WP$Possessive wh-pronoun36.WRBWh-adverb19926AmbiguityBuffalo buffalo buffalo.What makes this task hard?
THE buffalo FROM Buffalo WHO ARE buffaloed BY buffalo FROM Buffalo ALSO buffalo THE buffalo FROM Buffalo.
7How many words are ambiguous?
HockenmaierThese are just the tags words appeared with in the corpus. There may be unseen word/tag combinations.The main problem in the task of part of speech tagging is how to differentiate between the tags for ambiguous words8Nave approach!Pick the most common tag for the word
91% success rate!
Andrew McCallumWe have more informationWe are not just tagging words, we are tagging sequences of words
For a sequence of words W:W = w1w2w3wn
We are looking for a sequence of tags T: T = t1 t2 t3 tn
where P(T|W) is maximizedAndrew McCallumIn an ideal worldFind all instances of a sequence in the dataset and pick the most common sequence of tags
Count(heat oil in a large pot) = 0 ????Uhh
Spare data problemMost sequences will never occur, or will occur too few times for good predictions
Bayes RuleTo find P(T|W), use Bayes Rule:
We can maximize P(T|W) by maximizing P(W|T)*P(T)
? What is Bayes rule?P(W) wont change. Words dont change.12Finding P(T)Generally,
Usually not feasible to accurately estimate more than tag bigrams (possibly trigrams)
In some way (supervised or unsupervised, we are going to need to learn these probabilities. 13Markov assumptionAssume that the probability of a tag only depends on the tag that came directly before it
Only need to count tag bigrams. Putting it all togetherWe can similarly assume
And the final equation becomes:
Process as an HMMStart in an initial state t0 with probability (t0)Move from state ti to tj with transition probability a(tj|ti)In state ti, emit symbol wk with emission probability b(wk|ti) Adj.3.6Det.02.47Noun.3.7Verb.51.1.4the.4aP(w|Det).04low.02goodP(w|Adj).0001deal.001priceP(w|Noun)Three Questions for HMMsEvaluation Given a sequence of words W = w1w2w3wn and an HMM model , what is P(W|)
Decoding Given a sequence of words W and an HMM model , find the most probable parse T = t1 t2 t3 tn
Learning Given a tagged (or untagged) dataset, find the HMM that maximizes the data ? Remember the three questions17Three Questions for HMMsEvaluation Given a sequence of words W = w1w2w3wn and an HMM model , what is P(W|)
Tagging Given a sequence of words W and an HMM model , find the most probable parse T = t1 t2 t3 tn
Learning Given a tagged (or untagged) dataset, find the HMM that maximizes the data ? Remember the three questions18TaggingNeed to find the most likely tag sequence given a sequence of wordsmaximizes P(W|T)*P(T) and thus P(T|W)
tagstime stepst1tjtNEvaluation Task:
P(w1,w2,,wi) given in tj at time iDecoding Task:Decoding Task:
max P(w1,w2,,wi) given in tj at time iAt every one of the n time steps (=words), the HMM can be in one of N states (=tags)20Trellis
tagstime stepst1tjtNEvaluation Task:
P(w1,w2,,wi) given in tj at time iDecoding Task:
max log P(w1,w2,,wi) given in tj at time iAt every one of the n time steps (=words), the HMM can be in one of N states (=tags)21Tagging initialization
tagstime stepst1tjtN= log P(w1|tj) + log P(tj)At every one of the n time steps (=words), the HMM can be in one of N states (=tags)22Tagging recursive step
At every one of the n time steps (=words), the HMM can be in one of N states (=tags)23Tagging recursive step
At every one of the n time steps (=words), the HMM can be in one of N states (=tags)24Pick best trellis cell for last word
tagstime stepst1tjtNAt every one of the n time steps (=words), the HMM can be in one of N states (=tags)25
Use back pointers to pick best sequencetagstime stepst1tjtNAt every one of the n time steps (=words), the HMM can be in one of N states (=tags)26Learning a POS-tagging HMMEstimate the parameters in the model using counts
With smoothing, this model can get 95-96% correct tagging
With models of 40-80 tags27Problem with supervised learningRequires a large hand-labeled corpusDoesnt scale to new languagesExpensive to produceDoesnt scale to new domains
Instead, apply unsupervised learning with Expectation Maximization (EM)Expectation step: calculate probability of all sequences using set of parametersMaximization step: re-estimate parameters using results from E-step? reasonsExpensive to produce in both time and money. Unless get undergrads to do the tagging.Compute the forward and backward probabilities for given model parameters and our observationsWith EM, you need to know your tagset beforehand 28Lots of other techniques!Trigram models (more common)Text normalizationError-based transformation learning (Brill learning)Rule-based systemCalculate initial states: proper noun detection, tagged corpusAcquire transformation rulesChange VB to NN when prev word was adjectiveThe long race finally endedMinimally supervised learningUnlabeled data but have a dictionary
Seems like POS-tagging is solvedPenn Treebank POS-tagging accuracy human ceilingHuman agreement 97%
In other languages, not so much
Interannotator agreementOther languages with more complex morphology need much larger tagsets for tagging to be useful, and will contain many more distinct word forms in corpora of the same size30So now we are HMM MastersWe can use HMMs toTag words in a sentence with their parts of speechExtract entities and other information from a sentence
Can we use them to determine syntactic categories?
SyntaxRefers to the study of the way words are arranged together, and the relationship between them.
Prescriptive vs. Descriptive
Goal of syntax is to model the knowledge of that people unconsciously have about the grammar of their native language
Parsing extracts the syntax from a sentence
? Pres vs desc?That was an epic failCs getting too mainstream ? Star trek intro: to boldly go where no man has gone before.A precise way to define and describe the structure of sentences.
32Parsing applicationsHigh-precision Question-Answering systems
Named Entity Recognition (NER) and information extraction
Opinion extraction in product reviews
Improved interaction during computer applications/gamesFill in examples33Basic English sentence structureIkeeatscakeNoun(subject)Verb(head)Noun(object)HockenmaierCan we build an HMM?Noun(subject)Verb(head)Noun(object)Ike, dogs, eat, sleep, cake, science, HockenmaierWords take argumentsI eat cake.I sleep cake.I give you cake. I give cake. HmmI eat you cake???SubcategorizationIntransitive verbs: take only a subjectTransitive verbs: take a subject and an objectDitransitive verbs: take a subject, object, and indirect objectSelectional preferencesThe object of eat should be edibleHockenmaierA better modelNoun(subject)Transitive Verb(head)Noun(object)Ike, dogs, eat, like, cake, science, Intransitive Verb(head)sleep, run, HockenmaierLanguage has recursive propertiesCoronel Mustard killed Mrs. Peacock*Coronel Mustard killed Mrs. Peacock in the library*Coronel Mustard killed Mrs. Peacock in the library with the candlestick*
Coronel Mustard killed Mrs. Peacock in the library with the candlestick at midnight
NounPrepositionHMMs cant generate hierarchical structureCoronel Mustard killed Mrs. Peacock in the library with the candlestick at midnight.
Does Mustard have the candlestick?Or is the candlestick just sitting in the library?
MemorylessCant make long range decisions about attachmentsNeed a bet