december 2005csa3180: information extraction i1 csa3180: natural language processing information...

58
December 2005 CSA3180: Information Extr action I 1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction • Information Extraction • Named Entities • IE Systems • MUC • Finite State Machines • Pattern Recognition

Upload: amanda-west

Post on 25-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

December 2005 CSA3180: Information Extraction I 1

CSA3180: Natural Language Processing

Information Extraction 1 – Introduction• Information Extraction• Named Entities• IE Systems• MUC• Finite State Machines• Pattern Recognition

December 2005 CSA3180: Information Extraction I 2

Introduction

• Slides based on Lectures by Marti Hearst (2004)

• GATE Information Extraction

• http://www.gate.ac.uk/ie/

• Sheffield Web Intelligence Technologies

• http://nlp.shef.ac.uk/wig/

December 2005 CSA3180: Information Extraction I 3

Classification at different granularities

• Text Categorization: – Classify an entire document

• Information Extraction (IE):– Identify and classify small units within

documents

• Named Entity Extraction (NE):– A subset of IE– Identify and classify proper names

• People, locations, organizations

December 2005 CSA3180: Information Extraction I 4

Martin Baker, a person

Genomics job

Employers job posting form

December 2005 CSA3180: Information Extraction I 5

Aggregator Websites

December 2005 CSA3180: Information Extraction I 6

foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

December 2005 CSA3180: Information Extraction I 7

Aggregator Websites

• Read in many web pages from different sites

• Extract information into a database

• Screen Scraping

• Can then return data matching particular queries

• Data mining can extract meaningful insight that might not have been obvious

December 2005 CSA3180: Information Extraction I 8

December 2005 CSA3180: Information Extraction I 9

Data Mining

December 2005 CSA3180: Information Extraction I 10

IE from Research Papers

December 2005 CSA3180: Information Extraction I 11

IE from Commercial Websites

December 2005 CSA3180: Information Extraction I 12

What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

December 2005 CSA3180: Information Extraction I 13

What is Information Extraction?Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

December 2005 CSA3180: Information Extraction I 14

What is Information Extraction?Information Extraction = segmentation + classification + association

As a familyof techniques:

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

December 2005 CSA3180: Information Extraction I 15

What is Information Extraction?Information Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

December 2005 CSA3180: Information Extraction I 16

What is Information Extraction?Information Extraction = segmentation + classification + association

A familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

December 2005 CSA3180: Information Extraction I 17

IE in Context

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

December 2005 CSA3180: Information Extraction I 18

IE in Context: Formatting

Grammatical sentencesand some formatting & links

Text paragraphswithout formatting

Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

December 2005 CSA3180: Information Extraction I 19

IE in Context: Formatting

Non-grammatical snippets,rich formatting & links Tables

December 2005 CSA3180: Information Extraction I 20

IE in Context: CoverageWeb site specific

Amazon.com Book Pages

Formatting

Genre specific

Resumes

Layout

December 2005 CSA3180: Information Extraction I 21

IE in Context: CoverageWide, non-specific

University Names

Language

December 2005 CSA3180: Information Extraction I 22

IE in Context: ComplexityClosed set

He was born in Alabama…

The big Wyoming sky…

U.S. states

Regular set

Phone: (413) 545-1323

The CALD main office can be reached at 412-268-1299

U.S. phone numbers

Complex pattern

University of ArkansasP.O. Box 140Hope, AR 71802

U.S. postal addresses

Headquarters:1128 Main Street, 4th FloorCincinnati, Ohio 45210

…was among the six houses sold by Hope Feldman that year.

Ambiguous patterns,needing context andmany sources of evidence

Person names

Pawel Opalinski, SoftwareEngineer at WhizBang Labs.

December 2005 CSA3180: Information Extraction I 23

IE in Context: Single Field/Record

Single entity

Person: Jack Welch

Binary relationship

Relation: Person-TitlePerson: Jack WelchTitle: CEO

N-ary record

“Named entity” extraction

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Relation: Company-LocationCompany: General ElectricLocation: Connecticut

Relation: SuccessionCompany: General ElectricTitle: CEOOut: Jack WelshIn: Jeffrey Immelt

Person: Jeffrey Immelt

Location: Connecticut

December 2005 CSA3180: Information Extraction I 24

State of the Art

• Named entity recognition from newswire text– Person, Location, Organization, …– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)– F1 in 60’s or 70’s or 80’s

• Web site structure recognition– Extremely accurate performance obtainable– Human effort (~10min?) required on each site

December 2005 CSA3180: Information Extraction I 25

IE Generations

• Hand-Built Systems – Knowledge Engineering [1980s– ]– Rules written by hand– Require experts who understand both the systems and the

domain– Iterative guess-test-tweak-repeat cycle

• Automatic, Trainable Rule-Extraction Systems [1990s– ]– Rules discovered automatically using predefined templates,

using automated rule learners– Require huge, labeled corpora (effort is just moved!)

• Statistical Models [1997 – ]– Use machine learning to learn which features indicate

boundaries and types of entities.– Learning usually supervised; may be partially unsupervised

December 2005 CSA3180: Information Extraction I 26

IE Techniques

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Try alternatewindow sizes:

Boundary Models

Abraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State Machines

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

December 2005 CSA3180: Information Extraction I 27

Trainable IE SystemsPros

• Annotating text is simpler & faster than writing rules.

• Domain independent

• Domain experts don’t need to be linguists or programers.

• Learning algorithms ensure full coverage of examples.

Cons• Hand-crafted systems

perform better, especially at hard tasks. (but this is changing)

• Training data might be expensive to acquire

• May need huge amount of training data

• Hand-writing rules isn’t that hard!!

December 2005 CSA3180: Information Extraction I 28

MUC: Genesis of IE

• DARPA funded significant efforts in IE in the early to mid 1990’s.

• Message Understanding Conference (MUC) was an annual event/competition where results were presented.

• Focused on extracting information from news articles:– Terrorist events– Industrial joint ventures– Company management changes

• Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

December 2005 CSA3180: Information Extraction I 29

MUC

• Named entity• Person, Organization, Location

• Co-reference• Clinton President Bill Clinton

• Template element• Perpetrator, Target

• Template relation• Incident

• Multilingual

December 2005 CSA3180: Information Extraction I 30

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

December 2005 CSA3180: Information Extraction I 31

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

December 2005 CSA3180: Information Extraction I 32

MUC Templates

• Relationship• tie-up

• Entities:• Bridgestone Sports Co, a local concern, a Japanese trading

house

• Joint venture company• Bridgestone Sports Taiwan Co

• Activity• ACTIVITY 1

• Amount• NT$2,000,000

December 2005 CSA3180: Information Extraction I 33

MUC Templates

• ATIVITY 1– Activity

• Production

– Company• Bridgestone Sports Taiwan Co

– Product• Iron and “metal wood” clubs

– Start Date• January 1990

December 2005 CSA3180: Information Extraction I 34

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

Example from Fastus (1993)

December 2005 CSA3180: Information Extraction I 35

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

December 2005 CSA3180: Information Extraction I 36

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

December 2005 CSA3180: Information Extraction I 37

Evaluating IE Accuracy

• Always evaluate performance on independent, manually-annotated test data not used during system development.

• Measure for each test document:– Total number of correct extractions in the solution template: N– Total number of slot/value pairs extracted by the system: E– Number of extracted slot/value pairs that are correct (i.e. in the

solution template): C

• Compute average value of metrics adapted from IR:– Recall = C/N– Precision = C/E– F-Measure = Harmonic mean of recall and precision

December 2005 CSA3180: Information Extraction I 38

Results for 1997

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production

December 2005 CSA3180: Information Extraction I 39

Finite State Transducers for IE

• Basic method for extracting relevant information

• IE systems generally use a collection of specialized FSTs

• Company Name detection• Person Name detection• Relationship detection

December 2005 CSA3180: Information Extraction I 40

Equivalent Representations

Finite automata

Regularexpressions

Regular languages

Each can

describethe others

Theorem:

For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa.

December 2005 CSA3180: Information Extraction I 41

Finite State Automata Graphs

• The start state

• An accepting state

• A transitiona

• A state

December 2005 CSA3180: Information Extraction I 42

Finite Automata

• A FA is similar to a compiler in that: – A compiler recognizes legal programs

in some (source) language.– A finite-state machine recognizes legal strings

in some language.

• Example: Programming Language Identifiers– sequences of one or more letters or digits,

starting with a letter:

letterletter | digit

S A

December 2005 CSA3180: Information Extraction I 43

Finite Automata

• Transition

s1 a s2

• Is read

In state s1 on input “a” go to state s2

• If end of input– If in accepting state => accept– Otherwise => reject

• If no transition possible (got stuck) => reject• FSA = Finite State Automata

December 2005 CSA3180: Information Extraction I 44

Finite State Automata Language

• The language defined by a FSA is the set of strings accepted by the FSA.

– in the language of the FSM shown below: • x, tmp2, XyZzy, position27.

– not in the language of the FSM shown below: • 123, a?, 13apples.

letterletter | digit

S A

December 2005 CSA3180: Information Extraction I 45

Example: Integer Literals

• FSA that accepts integer literals with an optional + or - sign:• Note the two different edges from S to A• (+|-)?[0-9]+

+

digit

S

B

A-

digit

digit

December 2005 CSA3180: Information Extraction I 46

Finite State Automata Example

• FSA that accepts three letter English words that begin with p and end with d or t.

• Here I use the convenient notation of making the state name match the input that has to be on the edge leading to that state.

p

ta

o

u

d

i

December 2005 CSA3180: Information Extraction I 47

Formal Definition

• A finite automaton is a 5-tuple (, Q, , q, F) where:– An input alphabet – A set of states Q– A start state q– A set of accepting states F Q is the state transition function: Q x Q

(i.e., encodes transitions state input state)

December 2005 CSA3180: Information Extraction I 48

FSA Implementation

A table-driven approach:• table:

– one row for each state in the machine, and– one column for each possible character.

• Table[j][k] – which state to go to from state j on character k,– an empty entry corresponds to the machine getting

stuck.

• Note: when you use the re package in python, it converts your regex’s into efficient FSMs

December 2005 CSA3180: Information Extraction I 49

Terminology

• FSA: Finite State Automaton

• FSM: Finite State Machine– FSA, FSM used interchangibly

• FST: Finite State Transducer– The same as FSA, except each transition

produces output

December 2005 CSA3180: Information Extraction I 50

FSTs for IE

• FSTs are often compiled from regular expressions

• Probabilistic (weighted) FSTs• FSTs mean different things to different IE

approaches:– Based on lexical items (words)– Based on statistical language models– Based on deep syntactic/semantic analysis

• Several FSTs or a more complex FST can be used to find one type of information (e.g. company names)

December 2005 CSA3180: Information Extraction I 51

FSTs for IEFrodo Baggins works for Hobbit Factory, Inc.

Text Analyzer:

Frodo – Proper Name

Baggins – Proper Name

works – Verb

for – Prep

Hobbit – UnknownCap

Factory – NounCap

Inc – CompAbbr

December 2005 CSA3180: Information Extraction I 52

FSTs for IE

Frodo Baggins works for Hobbit Factory, Inc.

A regular expression for finding company names:“some capitalized words, maybe a comma,

then a company abbreviation indicator”

CompanyName = (ProperName | SomeCap)+

Comma?

CompAbbr

December 2005 CSA3180: Information Extraction I 53

FSTs for IE

Frodo Baggins works for Hobbit Factory, Inc.

ProperName works for CompanyName.

December 2005 CSA3180: Information Extraction I 54

FSTs for IEFrodo Baggins works for Hobbit Factory, Inc.

1 2 3 4

word

(CAP | PN)

(CAP| PN)

CAB

comma CAB

word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string

Company Name Detection FSA

December 2005 CSA3180: Information Extraction I 55

FSTs for IE

Frodo Baggins works for Hobbit Factory, Inc.

1 2 3 4

word word

(CAP | PN)

(CAP| PN)

CAB CN

comma CAB CN

word word

CAP = SomeCap, CAB = CompAbbr, PN = ProperName, = empty string, CN = CompanyName

Company Name Detection FST

December 2005 CSA3180: Information Extraction I 56

FSMs for Pattern Recognition

• Use regex’s at a higher level of description for pattern recognition

• Determining which person holds what office in what organization– [person] , [office] of [org]

• Vuk Draskovic, leader of the Serbian Renewal Movement

– [org] (named, appointed, etc.) [person] P [office]• NATO appointed Wesley Clark as Commander in Chief

• Determining where an organization is located– [org] in [loc]

• NATO headquarters in Brussels

– [org] [loc] (division, branch, headquarters, etc.)• KFOR Kosovo headquarters

December 2005 CSA3180: Information Extraction I 57

FASTUS

• Successful early IE system (1993)

• Built on Finite State Automata (FSA) transductions

December 2005 CSA3180: Information Extraction I 58

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

set upnew Taiwan dollars

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]