ch 9 part of speech tagging (slides adapted from dan jurafsky, jim martin, dekang lin, rada...
TRANSCRIPT
Ch 9 Part of Speech Tagging
(slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.)
Parts of Speech
8 (ish) traditional parts of speech
• Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc
• This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.)
• Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS
• We’ll use POS most frequently
POS examples for English
N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adj purple, tall, ridiculous ADV adverb unfortunately, slowly, P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those
Open Class Words
Every known human language has nouns and verbs
Nouns: people, places, things• Classes of nouns
—proper vs. common—count vs. mass
Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge!• Unfortunately, John walked home extremely
slowly yesterday
Definition:
An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives.
Closed Class Words
Differ more from language to language than open class words
Examples:• prepositions: on, under, over, …• particles: up, down, on, off, …• determiners: a, an, the, …• pronouns: she, who, I, ..• conjunctions: and, but, or, …• auxiliary verbs: can, may should, …• numerals: one, two, three, third, …
Prepositions from CELEX
Pronouns in CELEX
Conjunctions
Auxiliaries
NLP Task I – Determining Part of Speech Tags
The Problem:
nounpot
advnounadjlarge
noun-proper
noundeta
advnounprepin
nounoil
verbnounheat
POS listing in Brown CorpusWord
POS Tagging: Definition
The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
thekoalaputthe
keysonthe
table
WORDSTAGS
NVP
DET
POS Tagging example
WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
What is POS tagging good for?
Speech synthesis:• How to pronounce “lead”?• INsult inSULT• OBject obJECT• OVERflow overFLOW• DIScount disCOUNT• CONtent conTENT
Stemming for information retrieval• Knowing a word is a N tells you it gets plurals• Can search for “aardvarks” get “aardvark”
Parsing and speech recognition and etc• Possessive pronouns (my, your, her) followed by nouns• Personal pronouns (I, you, he) likely to be followed by verbs
Related Problem in Bioinformatics
Durbin et al. Biological Sequence Analysis, Cambridge University Press.
Several applications, e.g. proteins
From primary structure ATCPLELLLD
Infer secondary structure HHHBBBBBC..
History: From Yair Halevi (Bar-Ilan U.)
1960
1970
1980
1990
2000
Brown Corpus Created (EN-
US)1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS)93%-95%
Greene and Rubin
Rule Based - 70%
LOB Corpus Created (EN-UK)1 Million Words
DeRose/Church
Efficient HMMSparse Data
95%+
British National Corpus
(tagged by CLAWS)
POS Tagging separated from
other NLP
Transformation Based Tagging
(Eric Brill)Rule Based – 95%
+
Tree-Based Statistics (Helmut
Shmid)Rule Based – 96%
+Neural Network 96%
+
Trigram Tagger
(Kempe)96%+
Combined Methods
98%+
Penn Treebank Corpus
(WSJ, 4.5M)
LOB Corpus Tagged
British National Carpus
What is it used for?
Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus.
The main uses of the corpus, are as follows: Reference Book Publishing
• Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.
Linguistic Research• Raw data for studying lexis, syntax, morphology, semantics, discourse
analysis, stylistics, sociolinguistics... Artificial Intelligence
• Extensive data test bed for program development. Natural language processing
• Taggers, parsers, natural language understanding programs, spell checking word lists...
English Language Teaching• Syllabus and materials design, classroom reference, independent learner
research.
Penn Treebank Tagset
A Simplified Tagset for English
Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project.
34 tags + punctuationUPenn Treebank:
197 tagsLondon-Lund Corpus:
166 tagsLancaster UCREL group:
135 tagsLOB Corpus:
87 tagsBrown Corpus:
Rationale behind British & European tag sets
To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al. 1987
The Lund tagset for adverb distinguishes between
• Adjunct – Process, Space, Time• Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S• Conjunct – Appositional, Contrastive, Inferential, Listing, …• Disjunct – Content, Style• Postmodifier – “else”• Negative – “not”• Discourse Item – Appositional, Expletive, Greeting,
Hesitator, …
Reasons for a Smaller Tagset
Many tags are unique to particular lexical items, and can be recovered automatically if desired.
sung/VBNhad/HVNbeen/BENsinging/VBGhaving/HVGbeing/BEGsang/VBDhad/HVDwas/BEDsing/VBZhas/HVZis/BEZsing/VBhave/HVbe/BE
Brown Tags For Verbs
sung/VBNhad/VBNbeen/VBNsinging/VBGhaving/VBGbeing/VBGsang/VBDhad/VBDwas/VBDsing/VBZhas/VBZis/VBZsing/VBhave/VBbe/VB
Penn Treebank Tags For Verbs
Task I – Determining Part of Speech Tags
The Problem:
The Old Solution: Combinatorial search. • If each of n words has k tags on average, try the nk
combinations until one works.
nounpot
advnounadjlarge
noun-propernoundeta
advnounprepin
nounoil
verbnounheat
POS listing in BrownWord
NLP Task I – Determining Part of Speech Tags
Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment.
• The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora.
Simple Statistical Approaches: Idea 1
Simple Statistical Approaches: Idea 2
For a string of words
w = w1w2w3…wn
find the string of POS tags
T = t1 t2 t3 …tn
which maximizes P(T|W)• i.e., the probability of tag string T given that
the word string was w• i.e., that w was tagged T
Again, The Sparse Data Problem …
A Simple, Impossible Approach to Compute P(T|W):
Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..
A Practical Statistical Tagger
A Practical Statistical Tagger II
But we can't accurately estimate more than tag bigrams or so…
We change to a model that we CAN estimate:
A Practical Statistical Tagger III
So, for a given string W = w1w2w3…wn, the tagger needs to find the string of tags T which maximizes
Training and Performance
To estimate the parameters of this model, given an annotated training corpus:
Because many of these counts are small, smoothing is necessary for best results…
Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.