sequence classification: chunking & ner shallow processing techniques for nlp ling570 november...

Sequence Classification:

Chunking & NERShallow Processing Techniques for NLP

Ling570November 23, 2011

Roadmap Named Entity Recognition

Chunking

HW #9

Named Entity Recognition

RoadmapNamed Entity Recognition

Definition

Motivation

Challenges

Common Approach

Named Entity RecognitionTask: Identify Named Entities in (typically)

unstructured text

Typical entities:Person namesLocationsOrganizationsDatesTimes

ExampleMicrosoft released Windows Vista in 2007.

Example due to F. Xia


<ORG>Microsoft</ORG> released <PRODUCT>Windows Vista</PRODUCT> in <YEAR>2007</YEAR>




Entities:Often application/domain specific

Business intelligence:





Business intelligence: products, companies, featuresBiomedical:





Business intelligence: products, companies, featuresBiomedical: Genes, proteins, diseases, drugs, …


Named Entity TypesCommon categories

Named Entity ExamplesFor common categories:

Why NER?Machine translation:


Person


Person names typically not translatedPossibly transliteratedWaldheim

Number:


Person names typically not translatedPossibly transliteratedWaldheim

Number: 9/11: Date vs ratio911: Emergency phone number, simple number

Why NER?Information extraction:

MUC task: Joint ventures/mergersFocus on


MUC task: Joint ventures/mergersFocus on Company names, Person Names (CEO),

valuations



valuations

Information retrieval:Named entities focus of retrieval In some data sets, 60+% queries target NEs



valuations

Information retrieval:Named entities focus of retrieval In some data sets, 60+% queries target NEs

Text-to-speech:

Why NER? Information extraction:


valuations

Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs

Text-to-speech: 206-616-5728

Phone numbers (vs other digit strings) , differ by language

ChallengesAmbiguity

Washington chose

ChallengesAmbiguity

Washington choseD.C., State, George, etc

Most digit strings

ChallengesAmbiguity


Most digit strings

cat: (95 results)

ChallengesAmbiguity


Most digit strings

cat: (95 results)CAT(erpillar) stock tickerComputerized Axial TomographyChloramphenicol Acetyl Transferasesmall furry mammal

Context & Ambiguity

EvaluationPrecision

Recall

F-measure

ResourcesOnline:

Name listsBaby name, who’s who, newswire services,

census.govGazetteersSEC listings of companies

ToolsLingpipeOpenNLPStanford NLP toolkit

Approaches to NERRule/Regex-based:


Match names/entities in listsRegex:


Match names/entities in listsRegex: e.g \d\d/\d\d/\d\d: 11/23/11Currency: $\d+\.\d+



Machine Learning via Sequence Labeling:Better for names, organizations



Machine Learning via Sequence Labeling:Better for names, organizations

Hybrid

NER as Sequence Labeling

NER as Classification TaskInstance:

NER as Classification TaskInstance: token

Labels:


Labels:Position: B(eginning), I(nside), Outside


Labels:Position: B(eginning), I(nside), OutsideNER types: PER, ORG, LOC, NUM


Labels:Position: B(eginning), I(nside), OutsideNER types: PER, ORG, LOC, NUMLabel: Type-Position, e.g. PER-B, PER-I, O, …How many tags?


Labels:Position: B(eginning), I(nside), OutsideNER types: PER, ORG, LOC, NUMLabel: Type-Position, e.g. PER-B, PER-I, O, …How many tags?

(|NER Types|x 2) + 1

NER as Classification: Features

What information can we use for NER?



Predictive tokens: e.g. MD, Rev, Inc,..

How general are these features?



Predictive tokens: e.g. MD, Rev, Inc,..

How general are these features? Language? Genre? Domain?

NER as Classification: Shape Features

Shape types:


Shape types: lower: e.g. cumming

All lower case



All lower casecapitalized: e.g. Washington

First letter uppercase




First letter uppercaseall caps: e.g. WHO

all letters capitalized





all letters capitalizedmixed case: eBay

Mixed upper and lower case






Mixed upper and lower caseCapitalized with period: H.






Mixed upper and lower caseCapitalized with period: H.Ends with digit: A9



All lower case capitalized: e.g. Washington

First letter uppercase all caps: e.g. WHO

all letters capitalized mixed case: eBay

Mixed upper and lower case Capitalized with period: H. Ends with digit: A9 Contains hyphen: H-P

Example Instance Representation

Example

Sequence LabelingExample

EvaluationSystem: output of automatic tagging

Gold Standard: true tags



Precision: # correct chunks/# system chunks

Recall: # correct chunks/# gold chunks

F-measure:





F-measure:

F1 balances precision & recall

EvaluationStandard measures:

Precision, Recall, F-measureComputed on entity types (Co-NLL evaluation)



Classifiers vs evaluation measuresClassifiers optimize tag accuracy




Most common tag?




Most common tag? O – most tokens aren’t NEs

Evaluation measures focuses on NE




Most common tag? O – most tokens aren’t NEs

Evaluation measures focuses on NE

State-of-the-art:Standard tasks: PER, LOC: 0.92; ORG: 0.84

Hybrid ApproachesPractical sytems

Exploit lists, rules, learning…


Exploit lists, rules, learning…Multi-pass:

Early passes: high precision, low recallLater passes: noisier sequence learning




Hybrid system:High precision rules tag unambiguous mentions

Use string matching to capture substring matches




Hybrid system:High precision rules tag unambiguous mentions

Use string matching to capture substring matchesTag items from domain-specific name listsApply sequence labeler

Chunking

RoadmapChunking

Definition

Motivation

Challenges

Approach

What is Chunking?Form of partial (shallow) parsing


Extracts major syntactic units, but not full parse trees



Task: identify and classify Flat, non-overlapping segments of a sentence



Task: identify and classify Flat, non-overlapping segments of a sentenceBasic non-recursive phrases



Task: identify and classify Flat, non-overlapping segments of a sentenceBasic non-recursive phrasesCorrespond to major POS

May ignore some categories; i.e. base NP chunking




May ignore some categories; i.e. base NP chunkingCreate simple bracketing

[NPThe morning flight][PPfrom][NPDenver][Vphas arrived]




May ignore some categories; i.e. base NP chunkingCreate simple bracketing

[NPThe morning flight][PPfrom][NPDenver][Vphas arrived]

[NPThe morning flight] from [NPDenver] has arrived

Why Chunking?Used when full parse unnecessary


Or infeasible or impossible (when?)



Extraction of subcategorization frames Identify verb arguments

e.g. VP NP VP NP NP VP NP to NP





Information extraction: who did what to whom






Summarization: Base information, remove mods






Summarization: Base information, remove mods

Information retrieval: Restrict indexing to base NPs

Processing Example Tokenization: The morning flight from Denver has arrived


POS tagging: DT JJ N PREP NNP AUX V



Chunking: NP PP NP VP



Chunking: NP PP NP VP

Extraction: NP NP VP

etc

ApproachesFinite-state Approaches

Grammatical rules in FSTsCascade to produce more complex structure

ApproachesFinite-state Approaches

Grammatical rules in FSTsCascade to produce more complex structure

Machine LearningSimilar to POS tagging

Finite-State Rule-Based Chunking

Hand-crafted rules model phrasesTypically application-specific



Left-to-right longest match (Abney 1996)Start at beginning of sentenceFind longest matching rule



Left-to-right longest match (Abney 1996)Start at beginning of sentenceFind longest matching ruleGreedy approach, not guaranteed optimal


Chunk rules:Cannot contain recursion

NP -> Det Nominal:



NP -> Det Nominal: OkayNominal -> Nominal PP:



NP -> Det Nominal: OkayNominal -> Nominal PP: Not okay

Examples:NP (Det) Noun* NounNP Proper-NounVP VerbVP Aux Verb


Chunk rules: Cannot contain recursion

NP -> Det Nominal: OkayNominal -> Nominal PP: Not okay

Examples: NP (Det) Noun* Noun NP Proper-Noun VP Verb VP Aux Verb

Consider: Time flies like an arrow

Is this what we want?

Cascading FSTsRicher partial parsing

Pass output of FST to next FST

Cascading FSTsRicher partial parsing

Pass output of FST to next FST

Approach:First stage: Base phrase chunkingNext stage: Larger constituents (e.g. PPs, VPs)Highest stage: Sentences

Example

Chunking by ClassificationModel chunking as task similar to POS tagging

Instance:


Instance: tokens

Labels: Simultaneously encode segmentation &

identification


Instance: tokens


identification IOB (or BIO tagging) (also BIOE or BIOSE)

Segment: B(eginning), I (nternal), O(utside)


Instance: tokens



Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.


Instance: tokens



Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.The morning flight from Denver has arrivedNP-B NP-I NP-I PP-B NP-B VP-B VP-I


Instance: tokens

Labels: Simultaneously encode segmentation & identification IOB (or BIO tagging) (also BIOE or BIOSE)

Segment: B(eginning), I (nternal), O(utside)Identity: Phrase category: NP, VP, PP, etc.The morning flight from Denver has arrivedNP-B NP-I NP-I PP-B NP-B VP-B VP-INP-B NP-I NP-I NP-B

Features for ChunkingWhat are good features?


Preceding tagsfor 2 preceding words



Wordsfor 2 preceding, current, 2 following




Parts of speechfor 2 preceding, current, 2 following




Parts of speechfor 2 preceding, current, 2 following

Vector includes those features + true label

Chunking as ClassificationExample


Gold Standard: true tags Typically extracted from parsed treebank



F-measure:

F1 balances precision & recall

State-of-the-ArtBase NP chunking: 0.96


Complex phrases: Learning: 0.92-0.94Most learners achieve similar results

Rule-based: 0.85-0.92




Limiting factors:




Limiting factors:POS tagging accuracy Inconsistent labeling (parse tree extraction)Conjunctions

Late departures and arrivals are common in winterLate departures and cancellations are common in winter

Building a MaxEnt POS Tagger

Q1: Build feature vector representations for POS tagging in SVMlight format

maxent_features.* training_file testing_file rare_wd_threshold rare_feat_threshold outdir

training_file, testing_file: like HW#7w1/t1 w2/t2 …wn/tn

Filter rare words and infrequent features

Store vectors & intermediate representations in outdir

Feature RepresentationsFeatures:

Ratnaparkhi, 1996, Table 1 (duplicated in MaxEnt slides)

Character issues:Replace “,” with “comma”Replace “:” with “colon”

Mallet and svmlight format use these as delimiters

Q2: ExperimentsRun MaxEnt classification using your training and

test files

Compare effects of different thresholds on feature count, accuracy, and runtime

Note: Big filesThis assignment will produce even larger sets of

results that HW#8. Please gzip your tar files. If the DropBox won’t accept the files, you can store

the files on patas. Just let Sanghoun know where to find them.

sequence classification: chunking & ner shallow processing techniques for nlp ling570 november...

Documents