pos tagging: introduction

39
1 POS Tagging: Introduction Heng Ji [email protected] Feb 2, 2008 Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M

Upload: nero

Post on 22-Feb-2016

47 views

Category:

Documents


1 download

DESCRIPTION

POS Tagging: Introduction. Heng Ji [email protected] Feb 2, 2008. Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M. Some Administrative Stuff. Assignment 1 due on Feb 17 Textbook: required for assignments and final exam. Outline. Parts of speech (POS) Tagsets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: POS Tagging: Introduction

1

POS Tagging Introduction

Heng Ji

hengjicsqccunyeduFeb 2 2008

Acknowledgement some slides from Ralph Grishman Nicolas Nicolov JampM

239

Some Administrative Stuff Assignment 1 due on Feb 17 Textbook required for assignments and final

exam

339

Outline

Parts of speech (POS) Tagsets POS Tagging

Rule-based tagging Markup Format Open source Toolkits

439

What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)

Verb Noun Adjective Adverb Article hellip We can also include inflection

Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip

539

Parts of Speech 8 (ish) traditional parts of speech

Noun verb adjective preposition adverb article interjection pronoun conjunction etc

Called parts-of-speech lexical categories word classes morphological classes lexical tags

Lots of debate within linguistics about the number nature and universality of these

Wersquoll completely ignore this debate

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 2: POS Tagging: Introduction

239

Some Administrative Stuff Assignment 1 due on Feb 17 Textbook required for assignments and final

exam

339

Outline

Parts of speech (POS) Tagsets POS Tagging

Rule-based tagging Markup Format Open source Toolkits

439

What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)

Verb Noun Adjective Adverb Article hellip We can also include inflection

Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip

539

Parts of Speech 8 (ish) traditional parts of speech

Noun verb adjective preposition adverb article interjection pronoun conjunction etc

Called parts-of-speech lexical categories word classes morphological classes lexical tags

Lots of debate within linguistics about the number nature and universality of these

Wersquoll completely ignore this debate

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 3: POS Tagging: Introduction

339

Outline

Parts of speech (POS) Tagsets POS Tagging

Rule-based tagging Markup Format Open source Toolkits

439

What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)

Verb Noun Adjective Adverb Article hellip We can also include inflection

Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip

539

Parts of Speech 8 (ish) traditional parts of speech

Noun verb adjective preposition adverb article interjection pronoun conjunction etc

Called parts-of-speech lexical categories word classes morphological classes lexical tags

Lots of debate within linguistics about the number nature and universality of these

Wersquoll completely ignore this debate

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 4: POS Tagging: Introduction

439

What is Part-of-Speech (POS) Generally speaking Word Classes (=POS)

Verb Noun Adjective Adverb Article hellip We can also include inflection

Verbs Tense number hellip Nouns Number propercommon hellip Adjectives comparative superlative hellip hellip

539

Parts of Speech 8 (ish) traditional parts of speech

Noun verb adjective preposition adverb article interjection pronoun conjunction etc

Called parts-of-speech lexical categories word classes morphological classes lexical tags

Lots of debate within linguistics about the number nature and universality of these

Wersquoll completely ignore this debate

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 5: POS Tagging: Introduction

539

Parts of Speech 8 (ish) traditional parts of speech

Noun verb adjective preposition adverb article interjection pronoun conjunction etc

Called parts-of-speech lexical categories word classes morphological classes lexical tags

Lots of debate within linguistics about the number nature and universality of these

Wersquoll completely ignore this debate

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 6: POS Tagging: Introduction

639

7 Traditional POS Categories N noun chair bandwidth

pacing V verb study debate munch ADJ adj purple tall ridiculous ADV adverb unfortunately slowly P preposition of by to PRO pronoun I me mine DET determiner the a that those

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 7: POS Tagging: Introduction

739

POS Tagging

The process of assigning a part-of-speech or lexical class marker to each word in a collection WORD tag

the DETkoala Nput Vthe DETkeys Non Pthe DETtable N

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 8: POS Tagging: Introduction

839

Penn TreeBank POS Tag Set Penn Treebank hand-annotated corpus of

Wall Street Journal 1M words 46 tags Some particularities

to TO not disambiguated Auxiliaries and verbs not distinguished

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 9: POS Tagging: Introduction

939

Penn Treebank Tagset

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 10: POS Tagging: Introduction

1039

Why POS tagging is useful Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Stemming for information retrieval Can search for ldquoaardvarksrdquo get ldquoaardvarkrdquo

Parsing and speech recognition and etc Possessive pronouns (my your her) followed by nouns Personal pronouns (I you he) likely to be followed by verbs Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 11: POS Tagging: Introduction

1139

Equivalent Problem in Bioinformatics

Durbin et al Biological Sequence Analysis Cambridge University Press

Several applications eg proteins

From primary structure ATCPLELLLD

Infer secondary structure HHHBBBBBC

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 12: POS Tagging: Introduction

1239

Why is POS Tagging Useful First step of a vast number of practical tasks Speech synthesis

How to pronounce ldquoleadrdquo INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT

Parsing Need to know if a word is an N or V before you can parse

Information extraction Finding names relations etc

Machine Translation

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 13: POS Tagging: Introduction

1339

Open and Closed Classes Closed class a small fixed membership

Prepositions of in by hellip Auxiliaries may can will had been hellip Pronouns I you she mine his them hellip Usually function words (short common words which

play a role in grammar) Open class new ones can be created all the time

English has 4 Nouns Verbs Adjectives Adverbs Many languages have these 4 but not all

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 14: POS Tagging: Introduction

1439

Open Class Words Nouns

Proper nouns (Boulder Granby Eli Manning) English capitalizes these

Common nouns (the rest) Count nouns and mass nouns

Count have plurals get counted goatgoats one goat two goats Mass donrsquot get counted (snow salt communism) (two snows)

Adverbs tend to modify things Unfortunately John walked home extremely slowly yesterday Directionallocative adverbs (herehome downhill) Degree adverbs (extremely very somewhat) Manner adverbs (slowly slinkily delicately)

Verbs In English have morphological affixes (eateatseaten)

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 15: POS Tagging: Introduction

1539

Closed Class Words

Examples prepositions on under over hellip particles up down on off hellip determiners a an the hellip pronouns she who I conjunctions and but or hellip auxiliary verbs can may should hellip numerals one two three third hellip

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 16: POS Tagging: Introduction

1639

Prepositions from CELEX

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 17: POS Tagging: Introduction

1739

English Particles

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 18: POS Tagging: Introduction

1839

Conjunctions

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 19: POS Tagging: Introduction

1939

POS TaggingChoosing a Tagset

There are so many parts of speech potential distinctions we can draw

To do POS tagging we need to choose a standard set of tags to work with

Could pick very coarse tagsets N V Adj Adv

More commonly used set is finer grained the ldquoPenn TreeBank tagsetrdquo 45 tags PRP$ WRB WP$ VBG

Even more fine-grained tagsets exist

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 20: POS Tagging: Introduction

2039

Using the Penn Tagset

TheDT grandJJ juryNN commmentedVBD onIN aDT numberNN ofIN otherJJ topicsNNS

Prepositions and subordinating conjunctions marked IN (ldquoalthoughIN IPRPrdquo)

Except the prepositioncomplementizer ldquotordquo is just marked ldquoTOrdquo

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 21: POS Tagging: Introduction

2139

POS Tagging

Words often have more than one POS back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word

These examples from Dekang Lin

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 22: POS Tagging: Introduction

2239

How Hard is POS Tagging Measuring Ambiguity

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 23: POS Tagging: Introduction

2339

Current Performance How many tags are correct

About 97 currently But baseline is already 90 Baseline algorithm

Tag every word with its most frequent tag Tag unknown words as nouns

How well do people do

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 24: POS Tagging: Introduction

2439

Quick Test Agreement the students went to class plays well with others fruit flies like a banana

DT the this thatNN nounVB verbP prepostionADV adverb

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 25: POS Tagging: Introduction

2539

Quick Test the students went to class DT NN VB P NN plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 26: POS Tagging: Introduction

2639

How to do it History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93-95

Greene and RubinRule Based - 70

LOB Corpus Created (EN-UK)1 Million Words

DeRoseChurchEfficient HMMSparse Data

95+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based ndash 95+

Tree-Based Statistics (Helmut Shmid)

Rule Based ndash 96+

Neural Network 96+

Trigram Tagger(Kempe)

96+

Combined Methods98+

Penn Treebank Corpus

(WSJ 45M)

LOB Corpus Tagged

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 27: POS Tagging: Introduction

2739

Two Methods for POS Tagging

1 Rule-based tagging (ENGTWOL)

2 Stochastic1 Probabilistic sequence models

HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 28: POS Tagging: Introduction

2839

Rule-Based Tagging

Start with a dictionary Assign all possible tags to words from the

dictionary Write rules by hand to selectively remove

tags Leaving the correct tag for each word

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 29: POS Tagging: Introduction

2939

Rule-based taggers Early POS taggers all hand-coded Most of these (Harris 1962 Greene and Rubin

1971) and the best of the recent ones ENGTWOL (Voutilainen 1995) based on a two-stage architecture Stage 1 look up word in lexicon to give list of potential

POSs Stage 2 Apply rules which certify or disallow tag

sequences Rules originally handwritten more recently Machine

Learning methods can be used

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 30: POS Tagging: Introduction

3039

Start With a Dictionarybull she PRPbull promised VBNVBDbull to TObull back VB JJ RB NNbull the DTbull bill NN VB

bull Etchellip for the ~100000 words of English with more than 1 tag

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 31: POS Tagging: Introduction

3139

Assign Every Possible Tag

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 32: POS Tagging: Introduction

3239

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows ldquoltstartgt PRPrdquo

NN RB JJ VB

PRP VBD TO VB DT NNShe promised to back the bill

VBN

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 33: POS Tagging: Introduction

3339

Stage 1 of ENGTWOL Tagging First Stage Run words through FST morphological

analyzer to get all parts of speech Example Pavlov had shown that salivation hellip

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 34: POS Tagging: Introduction

3439

Stage 2 of ENGTWOL Tagging Second Stage Apply NEGATIVE constraints Example Adverbial ldquothatrdquo rule

Eliminates all readings of ldquothatrdquo except the one in ldquoIt isnrsquot that oddrdquo

Given input ldquothatrdquoIf(+1 AADVQUANT) if next word is adjadvquantifier(+2 SENT-LIM) following which is E-O-S(NOT -1 SVOCA) and the previous word is not a verb like ldquoconsiderrdquo which

allows adjective complements in ldquoI consider that oddrdquoThen eliminate non-ADV tagsElse eliminate ADV

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 35: POS Tagging: Introduction

3539

Inline Mark-up POS Tagging

httpnlpcsqccunyeduwsj_poszip Input Format

Pierre Vinken 61CD yearsNNS old will join the board as a nonexecutive director Nov 29

Output FormatPierreNNP VinkenNNP 61CD yearsNNS

oldJJ willMD joinVB theDT boardNN asIN aDT nonexecutiveJJ directorNN NovNNP 29CD

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 36: POS Tagging: Introduction

3639

POS Tagging Tools NYU Prof Ralph Grishmanrsquos HMM POS tagger

(in Java)httpnlpcsqccunyedujetzip httpnlpcsqccunyedujet_srczip httpwwwcsnyueducsfacultygrishmanjetlicensehtml

Demo How it works

Learned HMM datapos_hmmtxtSource code srcjetHMMHMMTaggerjava

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 37: POS Tagging: Introduction

3739

POS Tagging Tools Stanford tagger (Loglinear tagger )

httpnlpstanfordedusoftwaretaggershtml Brill tagger httpwwwtechplymacuksocstaffguidbugmsoftwareRULE_BASED

_TAGGER_V114tarZ tagger LEXICON test BIGRAMS LEXICALRULEFULE

CONTEXTUALRULEFILE YamCha (SVM) httpchasenorg~takusoftwareyamcha MXPOST (Maximum Entropy) ftpftpcisupennedupubadwaitjmx More complete list at httpwww-nlpstanfordedulinksstatnlphtmlTaggers

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 38: POS Tagging: Introduction

3839

NLP Toolkits Uniform CL Annotation Platform

UIMA (IBM NLP platform) httpincubatorapacheorguimasvnhtml Mallet (UMASS) httpmalletcsumasseduindexphpMain_Page MinorThird (CMU) httpminorthirdsourceforgenet NLTK httpnltksourceforgenet

Natural langauge toolkit with data sets Demo

Information Extraction Jet (NYU IE toolkit)

httpwwwcsnyueducsfacultygrishmanjetlicensehtml Gate httpgateacukdownloadindexhtml

University of Sheffield IE toolkit Information Retrieval

INDRI httpwwwlemurprojectorgindri Information Retrieval toolkit

Machine Translation Compara httpadamastorlinguatecaptCOMPARAWelcomehtml ISI decoder httpwwwisiedulicensed-swrewrite-decoder MOSES httpwwwstatmtorgmoses

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class
Page 39: POS Tagging: Introduction

3939

Looking Ahead Next Class Machine Learning for POS Tagging

Hidden Markov Model

  • POS Tagging Introduction
  • Some Administrative Stuff
  • Outline
  • What is Part-of-Speech (POS)
  • Parts of Speech
  • 7 Traditional POS Categories
  • POS Tagging
  • Penn TreeBank POS Tag Set
  • Penn Treebank Tagset
  • Why POS tagging is useful
  • Equivalent Problem in Bioinformatics
  • Why is POS Tagging Useful
  • Open and Closed Classes
  • Open Class Words
  • Closed Class Words
  • Prepositions from CELEX
  • English Particles
  • Conjunctions
  • POS Tagging Choosing a Tagset
  • Using the Penn Tagset
  • Slide 21
  • How Hard is POS Tagging Measuring Ambiguity
  • Current Performance
  • Quick Test Agreement
  • Quick Test
  • How to do it History
  • Two Methods for POS Tagging
  • Rule-Based Tagging
  • Rule-based taggers
  • Start With a Dictionary
  • Assign Every Possible Tag
  • Write Rules to Eliminate Tags
  • Stage 1 of ENGTWOL Tagging
  • Stage 2 of ENGTWOL Tagging
  • Inline Mark-up
  • POS Tagging Tools
  • Slide 37
  • NLP Toolkits
  • Looking Ahead Next Class