information extraction shallow processing techniques for nlp ling570 december 5, 2011

Post on 19-Dec-2015

229 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Information Extraction

Shallow Processing Techniques for NLPLing570

December 5, 2011

RoadmapInformation extraction:

Definition & Motivation

General approach

Relation extraction techniques:Fully supervised

Lightly supervised

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems

Information ExtractionMotivation:

Unstructured data hard to manage, assess, analyze

Many tools for manipulating structured data:Databases: efficient search, agglomerationStatistical analysis packagesDecision support systems

Goal:Given unstructured information in textsConvert to structured data

E.g., populate DBs

IE ExampleCiting high fuel prices, United Airlines said Friday it

has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: Amount: Effective Date: Follower:

IE ExampleCiting high fuel prices, United Airlines said Friday it

has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco. Lead airline: United Airlines Amount: $6 Effective Date: 2006-10-26 Follower: American Airlines

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents Incident type, Actor, Victim, Date, Location

General:

Other ApplicationsShared tasks:

Message Understanding Conferences (MUC)Joint ventures/mergers:

CompanyA, CompanyB, CompanyC, amount, type

Terrorist incidents Incident type, Actor, Victim, Date, Location

General:PricegrabberStock market analysisApartment finder

Information ExtractionComponents

Incorporates range of techniques, tools

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Semi-, un-supervised learning: clustering, bootstrapping

Information ExtractionComponents

Incorporates range of techniques, toolsFSTs (pattern extraction)Supervised learning:

ClassificationSequence labeling

Semi-, un-supervised learning: clustering, bootstrapping

Part-of-Speech taggingNamed Entity RecognitionChunkingParsing

Information Extraction Process

Typically pipeline of major components

Information Extraction Process

Typically pipeline of major components

First step: Named Entity Recognition (NER)Often application-specific

Generic type: Person, Organiztion, LocationSpecific: genes to college courses

Information Extraction Process

Typically pipeline of major components

First step: Named Entity Recognition (NER)Often application-specific

Generic type: Person, Organiztion, LocationSpecific: genes to college courses

Identify NE mentions: specific instances

Example: Pre-NERCiting high fuel prices, United Airlines said Friday

it has increased fares by $6 per round trip to some cities also served by lower-cost carriers. American Airlines, a unit of AMR Corp. immediately matched the move, spokesman Time Wagner said. United, a unit of UAL Corp., said the increase took effect Thursday an applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Denver to San Francisco.

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)

Information Extraction:Reference Resolution

Builds on NER

Clusters entity mentions referring to same thingsCreates entity equivalence classes

Includes anaphora resolutionE.g. pronoun resolution (it, he, she…; one; this, that)

Also non-anaphoric, general mentions

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relations

Generic relations:

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relations

Generic relations:

Family, employment, part-whole, membership, geospatial

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Relation Detection & Classification

Relation detection & classification:Find and classify relations among entities in text

Mix of application-specific and generic relationsGeneric relations:

Family, employment, part-whole, membership, geospatial

Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines

Relation Detection & Classification

Relation detection & classification: Find and classify relations among entities in text

Mix of application-specific and generic relationsGeneric relations:

Family, employment, part-whole, membership, geospatial

Generic relations:United is part-of UAL Corp.American Airlines is part-of AMRTim Wagner is an employee of American Airlines

Domain-specific relations:United serves Chicago; United serves Denver, etc

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Event Detection &Classification

Event detection and classification:Find key events

Event Detection &Classification

Event detection and classification:Find key events

Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases

How cued?

Event Detection &Classification

Event detection and classification:Find key events

Fare increase by UnitedMatching fare increase by AmericanReporting of fare increases

How cued? Instances of “said” and ”cite”

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

Example: post-NERCiting high fuel prices, [ORGUnited Airlines]

said [TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

E.g. Friday, Thursday in example text

Temporal Expressions:Recognition and Analysis

Temporal expression recognitionFinds temporal events in text

E.g. Friday, Thursday in example textOthers:

Date expressions: Absolute: days of week, holidays Relative: two days from now, next year

Clock times: noon, 3:30pm

Temporal Expressions:Recognition and Analysis

Temporal expression recognition Finds temporal events in text

E.g. Friday, Thursday in example text Others:

Date expressions: Absolute: days of week, holidays Relative: two days from now, next year

Clock times: noon, 3:30pm

Temporal analysis: Map expressions to points/spans on timeline

Anchor: e.g. Friday, Thursday: tied to example dateline

Template FillingTexts describe stereotypical situations in domain

Identify texts evoking template situations

Fill slots in templates based on text

Template FillingTexts describe stereotypical situations in domain

Identify texts evoking template situations

Fill slots in templates based on text

In example: fare-raise is stereo-typical event

Information Extraction Steps

Named Entity Recognition

Reference Resolution

Relation Detection & Classification

Event Detection & Classification

Temporal Expression Recognition & Analysis

Template-filling

Relation Detection & Classification

Common Relations

Relations and Entitiesfrom Example

Supervised Approaches to Relation Analysis

Two stage process:

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Drawn from labeled training data Negative examples:

Supervised Approaches to Relation Analysis

Two stage process:Relation detection:

Detect whether a relation holds between two entities Positive examples:

Drawn from labeled training data Negative examples:

Generated from within-sentence entity pairs w/no relation

Relation classification:Label each related pair of entities by type

Multi-class classification task

Feature DesignWhich features can we use to classify relations?

Example: post-NERCiting high fuel prices, [ORGUnited Airlines] said

[TimeFriday] it has increased fares by [MONEY$6] per round trip to some cities also served by lower-cost carriers. [ORGAmerican Airlines], a unit of [ORGAMR Corp.] immediately matched the move, spokesman [PERTime Wagner] said. [orgUnited], a unit of [ORGUAL Corp.], said the increase took effect [TIMEThursday] an applies to most routes where it competes against discount carriers, such as [LOCChicago] to [LOCDallas] and [LOCDenver] to [LOCSan Francisco].

Common Relations

Feature DesignWhich features can we use to classify relations?

Feature DesignWhich features can we use to classify relations?

Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments

Feature DesignWhich features can we use to classify relations?

Named entity features:Named entity types of candidate argumentsConcatenation of two entity typesHead words of argumentsBag-of-words from arguments

Context word features:Bag-of-words/bigrams between entities (+/-

stemmed)Words and stems before/after entitiesDistance in words, # entities between arguments

Feature DesignFeatures for relation classification

Syntactic structure features can signal relations E.g. appositive signals part-of relation

Feature DesignFeatures for relation classification

Syntactic structure features can signal relations E.g. appositive signals part-of relation

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?Binary features: Detection of construction patterns

Feature DesignSyntactic structure features

Derived from syntactic analysisFrom base-phrase chunking to full constituent parse

Features:Presence of constructionChunk base-phrase pathsBag-of-chunk headsDependency/constituency pathDistance between arguments

How can we represent syntax?Binary features: Detection of construction patternsStructure features: Encode paths through trees

Some Example Features

Lightly Supervised Approaches

Supervised approaches highly effective, but

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, but

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:Build on small seed sets of patterns that capture

relations of interestE.g. manually constructed regular expressions

Lightly Supervised Approaches

Supervised approaches highly effective, butRequire large manually annotated corpus for

training

Unsupervised approaches interesting, butMay not yield results consistent with desired

categories

Lightly supervised approaches:Build on small seed sets of patterns that capture

relations of interestE.g. manually constructed regular expressions

Iteratively generalize and refine patterns to optimize

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern:

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern: / * has a hub at * /

In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

Bootstrapping RelationsCreate seed patterns and instances

Example: finding airline hub citiesInitial pattern: / * has a hub at * /

In a large corpus finds examples like: Delta has a hub at LaGuardia. Milwaukee-based Midwest has a hub at KCI. American airlines has a hub at the San Juan airport

Instances: (Delta, LaGuardia), (Midwest,KCI),(American,San Juan)

Bootstrapping RelationsIssues:

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix?

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at

Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix?

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns

Bootstrapping RelationsIssues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns – see issue 1

Bootstrapping Relations Issues:

False alarms:Some matches aren’t valid relations

The catheter has a hub at the proximal end. A star topology often has a hub at the center.

Fix? /[ORG] has hub at [LOC]/ Misses:

Some instances are narrowly missed by pattern No frills rival easyJet, which has established a hub at Liverpool. Ryanair also has a continental hub at Charleroi airport.

Fix? Relax patterns – see issue 1 Add new high precision patterns

BootstrappingHow can we obtain more high quality patterns?

BootstrappingHow can we obtain more high quality patterns?

Human labeling?

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above…Use tuples identified by seed patternsFind other occurrences of tuplesExtract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hubOccurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns:

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub Occurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns: /[ORG], which uses [LOC] as a hub /

BootstrappingHow can we obtain more high quality patterns?

Human labeling? – see above… Use tuples identified by seed patterns Find other occurrences of tuples Extract patterns from those occurrences

E.g. tuple: Ryanair, Charleroi and hub Occurrences:

Budget airline Ryanair, which uses Charleroi as a hubAll flights in and out of Ryanair’s Belgian hub at Charleroi

Patterns: /[ORG], which uses [LOC] as a hub / /[ORG]’s hub at [LOC] /

Bootstrapping IssuesIssues:

How do we represent patterns?

How to we assess the patterns?

How to we assess the tuples?

Pattern RepresentationGeneral approaches:

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Machine learning style feature vectorsMore flexible, higher recall

Pattern RepresentationGeneral approaches:

Regular expressionsSpecific, high precision

Machine learning style feature vectorsMore flexible, higher recall

Components in representation:Context before first entity mentionContext between entity mentionsContext after last entity mentionOrder of arguments

Assessing Patterns & Tuples

Issue:

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Balance two main factors:Performance on seed tuples

Hits and missesProductivity in full collection

Finds

Assessing Patterns & Tuples

Issue: No gold standard!Can use seed patterns

Balance two main factors:Performance on seed tuples

Hits and missesProductivity in full collection

Finds

Use conservative strategy to minimize semantic driftE.g. Sydney has a ferry hub at Circular Key.

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpus

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)

Intuition: Don’t care how many times we find same tuple

Relation Analysis Evaluation

Approaches:Standardized, labeled data sets

Precision, recall, f-measure ‘Labeled’ precision: test both detection and classification ‘Unlabeled’ precision: test only detection

No fully labeled corpusEvaluate retrieved tuples (e.g. DB contents)

Intuition: Don’t care how many times we find same tuple

Checked by human assessors

top related