relation extraction

107
Relation Relation Extraction Extraction Slides from Dan Jurafsky, Rion Slides from Dan Jurafsky, Rion Snow, Jim Martin, Chris Manning Snow, Jim Martin, Chris Manning and William Cohen and William Cohen

Upload: kelli

Post on 21-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Relation Extraction. Slides from Dan Jurafsky , Rion Snow, Jim Martin, Chris Manning and William Cohen. Background: Information Extraction. Extract information from text Sometimes called text analytics commercially - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Relation Extraction

Relation ExtractionRelation Extraction

Slides from Dan Jurafsky, Rion Snow, Slides from Dan Jurafsky, Rion Snow, Jim Martin, Chris Manning and William Jim Martin, Chris Manning and William

CohenCohen

Page 2: Relation Extraction

Background: Information Background: Information ExtractionExtraction

Extract information from textExtract information from text Sometimes called text analytics

commercially Extract entities (the people,

organizations, locations, times, dates, genes, diseases, medicines, etc. in a text)

Extract the relations between entities Figure out the larger events that are

taking place

Page 3: Relation Extraction

Information ExtractionInformation Extraction

Creating knowledge bases and ontologiesCreating knowledge bases and ontologies Implications for cognitive modelingImplications for cognitive modeling Digital LibariesDigital Libaries

Google scholar, Citeseer need to extract the title, author and references

BioinformaticsBioinformatics Patent analysisPatent analysis Specific market segments for stock analysisSpecific market segments for stock analysis SEC filingsSEC filings Intelligence analysisIntelligence analysis

Page 4: Relation Extraction

Paradigms: Data Mining vs Text Paradigms: Data Mining vs Text MiningMining

Data Mining Data Mining is a set of techniques that aims to is a set of techniques that aims to extract knowledge by big data analysis, e.g. from extract knowledge by big data analysis, e.g. from databases, using simple patterns and statistical databases, using simple patterns and statistical methodsmethods

In contrast, the In contrast, the Text MiningText Mining 's goal (or 's goal (or Machine Machine ReadingReading),),

is to extract relationship between entities that just is to extract relationship between entities that just existing in examinated existing in examinated text, text, semantically deeper semantically deeper than data mining doesthan data mining does

The difference is in the source information:The difference is in the source information: relationship in text just exist in our source, we don't need to try to guess it but only discover it!

Page 5: Relation Extraction

OutlineOutline

Reminder: Named Entity TaggingReminder: Named Entity Tagging Relation ExtractionRelation Extraction

Hand-built patterns Seed (bootstrap) methods Supervised classification Distant supervision

Page 6: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 7: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 8: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

“named entity extraction”

Page 9: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 10: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 11: Relation Extraction

What is “Information What is “Information Extraction”Extraction”

Slide from William Cohen

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

Page 12: Relation Extraction

Extracting Structured Extracting Structured KnowledgeKnowledge

LLNL EQ Lawrence Livermore National Laboratory LLNL LOC-IN CaliforniaLivermore LOC-IN CaliforniaLLNL IS-A scientific research laboratoryLLNL FOUNDED-BY University of CaliforniaLLNL FOUNDED-IN 1952

Each article can contain hundreds or thousands of items of knowledge...

“The Lawrence Livermore National Laboratory (LLNL) in Livermore, California is a scientific research

laboratory founded by the University of California in 1952.”

Page 13: Relation Extraction

Goal: Machine-readable summariesGoal: Machine-readable summaries

SubjectSubject RelationRelation ObjectObject

p53 is_a protein

Bax is_a protein

p53has_functio

napoptosis

Baxhas_functio

ninduction

apoptosis involved_in cell_death

Bax is_in mitochondrialouter membrane

Bax is_in cytoplasm

apoptosis related_to caspase activation

... ... ...

Textual abstract: Summary for human

Structured knowledge extraction: Summary for

machine

Page 14: Relation Extraction

From Unstructured Text to Structured Knowledge

Unstructured Text

News articles...slide from Rion Snow

Page 15: Relation Extraction

From Unstructured Text to Structured Knowledge

Unstructured Text

Blog posts....slide from Rion Snow

Page 16: Relation Extraction

From Unstructured Text to Structured Knowledge

Unstructured Text

Scientific journal articles...slide from Rion Snow

Page 17: Relation Extraction

From Unstructured Text to Structured Knowledge

Unstructured Text

Tweets, instant messages, chat logs...slide from Rion Snow

Page 18: Relation Extraction

From Unstructured Text to Structured Knowledge

Unstructured Text

slide from Rion Snow

Page 19: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 20: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 21: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 22: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 23: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 24: Relation Extraction

From Unstructured Text to Structured Knowledge

Structured KnowledgeUnstructured Text

slide from Rion Snow

Page 25: Relation Extraction

Relation ExtractionRelation Extraction

What is relation extraction?What is relation extraction? Founded in 1801 as South Carolina College, Founded in 1801 as South Carolina College,

USC is the USC is the flagship institution of the institution of the University of South Carolina System and offers and offers more than 350 programs of study leading to more than 350 programs of study leading to bachelor's, , master's, and , and doctoral degrees from degrees from fourteen degree-granting colleges and schools to fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki]30,967 on the main Columbia campus. … [wiki]

complex relation = summarizationcomplex relation = summarization focus on binary relation predicate(subject, object) focus on binary relation predicate(subject, object)

or triples <subj predicate obj>or triples <subj predicate obj>

Page 26: Relation Extraction

Wiki Info Box – Wiki Info Box – structured datastructured data

template template • standard things about standard things about

UniversitiesUniversities• EstablishedEstablished• typetype• facultyfaculty• studentsstudents• locationlocation• mascotmascot

Page 27: Relation Extraction

Focus on extracting binary Focus on extracting binary relationsrelations

• predicate(subject, object) from predicate logicpredicate(subject, object) from predicate logic

• triples <subj relation object>triples <subj relation object>

• Directed graphsDirected graphs

Page 28: Relation Extraction

Why relation extraction?Why relation extraction?

create new structured KBcreate new structured KB Augmenting existing: words -> wordnet, facts -> Augmenting existing: words -> wordnet, facts ->

FreeBase or DBPediaFreeBase or DBPedia Support question answering: Jeopardy Support question answering: Jeopardy Which relationsWhich relations

Automated Content Extraction (ACE) Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/http://www.itl.nist.gov/iad/mig//tests/ace/

17 relations17 relations

ACE examplesACE examples

Page 29: Relation Extraction

Unified Medical Language Unified Medical Language System (UMLS)System (UMLS)

UMLS: Unified Medical 134 entities, 54 UMLS: Unified Medical 134 entities, 54 relationsrelations

http://www.nlm.nih.gov/research/umls/

Page 30: Relation Extraction

UMLS semantic networkUMLS semantic network

Page 31: Relation Extraction

Current Relations in the Current Relations in the UMLS Semantic NetworkUMLS Semantic Network  isa isa     associated_with     associated_with         physically_related_to         physically_related_to             part_of             part_of             consists_of             consists_of             contains             contains             connected_to             connected_to             interconnects             interconnects             branch_of             branch_of             tributary_of             tributary_of             ingredient_of             ingredient_of         spatially_related_to         spatially_related_to             location_of             location_of             adjacent_to             adjacent_to             surrounds             surrounds             traverses             traverses         functionally_related_to         functionally_related_to             affects             affects                  …                  …                                 

……temporally_related_to temporally_related_to              co-occurs_with              co-occurs_with              precedes             precedesconceptually_related_toconceptually_related_toevaluation_of evaluation_of              degree_of              degree_of              analyzes              analyzes                  assesses_effect_of                  assesses_effect_of              measurement_of              measurement_of              measures              measures              diagnoses              diagnoses              property_of              property_of              derivative_of              derivative_of              developmental_form_of              developmental_form_of              method_of              method_of              …             …

Page 32: Relation Extraction

Databases of Wikipedia Databases of Wikipedia RelationsRelations

• DBpedia is a crowd-sourced community effortDBpedia is a crowd-sourced community effort• to extract structured information from to extract structured information from

WikipediaWikipedia• and to make this information readily availableand to make this information readily available• DBpedia allows you to make sophisticated DBpedia allows you to make sophisticated

queriesqueries

http://dbpedia.org/About

Page 33: Relation Extraction

English version of the English version of the DBpedia knowledge baseDBpedia knowledge base• 3.77 million things3.77 million things• 2.35 million are classified in an ontology2.35 million are classified in an ontology• including:including:

• including 764,000 persons, • 573,000 places (including 387,000 populated

places), • 333,000 creative works (including 112,000 music

albums, 72,000 films and 18,000 video games), • 192,000 organizations (including 45,000

companies and 42,000 educational institutions), • 202,000 species and • 5,500 diseases.

Page 34: Relation Extraction

freebasefreebase

google (freebase wiki) google (freebase wiki) http://wiki.freebase.com/wiki/Main_Pagehttp://wiki.freebase.com/wiki/Main_Page

Page 35: Relation Extraction

Ontological relationsOntological relations

Ontological relationsOntological relations• IS-A hypernymIS-A hypernym• Instance-ofInstance-of• has-Parthas-Part• hyponym (opposite of hypernym)hyponym (opposite of hypernym)

Page 36: Relation Extraction

Reminder: Task 1: Named Reminder: Task 1: Named Entity TaggingEntity Tagging

Slide from Chris Manning

General NER or Biomedical NER

<PER> John Hennessy</PER> is a professor at <ORG> Stanford University </ORG>, in <LOC> Palo Alto </LOC>.

<RNA> TAR </RNA> independent transactivation by <PROTEIN> Tat </PROTEIN> in cells derived from the <CELL> CNS </CELL> - a novel mechanism of <DNA> HIV-1 gene </DNA> regulation.

Page 37: Relation Extraction

Reminder: Maximum Reminder: Maximum Entropy Markov ModelEntropy Markov Model

K

kjk

m

jj

j

m

jj

thf

thf

htP

1 1

1

)),(exp(

)),(exp(

)|(

Slide from Chris Manning

DNAO DNA

of HIV−1 gene regulation

O

Page 38: Relation Extraction

Task II: Relation ExtractionTask II: Relation Extraction

Page 39: Relation Extraction

Relations between wordsRelations between words

Language Understanding Applications needs word Language Understanding Applications needs word meaning!meaning! Question answering Conversational agents Summarization

One key meaning component: word relationsOne key meaning component: word relations Hierarchical (ontological) relations

• “San Francisco” ISA “city” Other relations between words

• “alternator” is a part of a “car”

Page 40: Relation Extraction

Relation PredictionRelation Prediction

How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?

“...works by such authors as Herrick, Goldsmith, and Shakespeare.”

“If you consider authors like Shakespeare...”

“Shakespeare, author of The Tempest...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

Shakespeare IS-A author (0.87)

Page 41: Relation Extraction

HyponymyHyponymy

One sense is a One sense is a hyponymhyponym of another if the first of another if the first sense is more specific, denoting a subclass of the sense is more specific, denoting a subclass of the otherother car is a hyponym of vehicle dog is a hyponym of animal mango is a hyponym of fruit

ConverselyConversely vehicle is a hypernym/superordinate of car animal is a hypernym of dog fruit is a hypernym of mango

superordinate

vehicle fruit furniture mammal

hyponym car mango chair dog

Page 42: Relation Extraction

42

X is-a-part-of Y(meronym / holonym)

X is-a-kind-of Y(hyponym / hypernym)

WordNet relations

Page 43: Relation Extraction

WordNet is incompleteWordNet is incomplete

In WordNet 2.1 Not in WordNet“insulin”“progesterone”

“leptin”“pregnenolone”

“combustibility”“navigability

“affordability”“reusability”

“HTML” “XML”“Google”, “Yahoo” “Microsoft”, “IBM”

Ontological relations are missing for many words

Especially true for specific domains (restaurants, auto parts, finance)

Page 44: Relation Extraction

Other kinds of Relations: Disease Other kinds of Relations: Disease OutbreaksOutbreaks

Extract structured information from textExtract structured information from text

Slide from Eugene Agichtein

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire

Information Extraction System (e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

Page 45: Relation Extraction

More relations: Protein More relations: Protein InteractionsInteractions

„We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complexand that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“

CBF-A CBF-C

CBF-B CBF-A-CBF-C complex

interactcomplex

associates

Slide from Rosario and Hearst

Page 46: Relation Extraction

Yet More RelationsYet More Relations

CHICAGO (AP) — Citing high fuel prices, United Airlines said CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. some cities also served by lower-cost carriers. American American Airlines, a unit AMRAirlines, a unit AMR, immediately matched the move, , immediately matched the move, spokesman Tim Wagnerspokesman Tim Wagner said. said. United, a unit of UALUnited, a unit of UAL, said the , said the increase took effect Thursday night and applies to most routes increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as where it competes against discount carriers, such as Chicago to Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles Dallas and Atlanta and Denver to San Francisco, Los Angeles and New Yorkand New York

Slide from Jim Martin

Page 47: Relation Extraction

Relation TypesRelation Types

For generic news texts...For generic news texts...

Slide from Jim Martin

Page 48: Relation Extraction

More relations: UMLSMore relations: UMLS

Unified Medical Language SystemUnified Medical Language System integrates linguistic, terminological and semantic information Semantic Network consists of 134 semantic types and 54

relations between types

Pharmacologic Substance affects Pathologic FunctionPharmacologic Substance causes Pathologic FunctionPharmacologic Substance complicates Pathologic FunctionPharmacologic Substance diagnoses Pathologic FunctionPharmacologic Substance prevents Pathologic FunctionPharmacologic Substance treats Pathologic Function

Slide from Paul Buitelaar

Page 49: Relation Extraction

Relations in Ontologies: GO (Gene Relations in Ontologies: GO (Gene Ontology)Ontology) GO (Gene Ontology)GO (Gene Ontology)

Aligns descriptions of gene products in different databases, including plant, animal and microbial genomes

Organizing principles are molecular function, biological process and cellular component

Accession: GO:0009292Ontology: biological processSynonyms: broad: genetic exchangeDefinition: In the absence of a sexual life cycle, the processes

involved in the introduction of genetic information to create a genetically different individual.

Term Lineage all : all (164142)GO:0008150 : biological process (115947)

GO:0007275 : development (11892)GO:0009292 : genetic transfer (69)

Slide from Paul Buitelaar

Page 50: Relation Extraction

Relations in Ontologies: Relations in Ontologies: geographicalgeographical

OntologyF-Logic

similar

city

NeckarZugspitze

Geographical Entity (GE)

Natural GE Inhabited GE

countryrivermountain

instance_of

Germany

BerlinStuttgart

is-a

flow_through

located_in

capital_of

flow_through

flow_through

located_in

capital_of

367

length (km)

2962

height (m)

Design: Philipp CimianoSlide from Paul Buitelaar

Page 51: Relation Extraction

MeSH (Medical Subject MeSH (Medical Subject Headings) ThesaurusHeadings) Thesaurus

51

MeSH Descriptor Definition

Synonym set

Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and Il-Yeol Song

Page 52: Relation Extraction

MeSH TreeMeSH TreeMeSH OntologyMeSH Ontology

Hierarchically arranged from most general to most specific.

Actually a graph rather than a tree

• normally appear in more than one place in the tree

MeSH Tree

Slide from Yoo, Hu, Song

Page 53: Relation Extraction

Types of ACE Relations, Types of ACE Relations, 20032003

ROLE ROLE - relates a person to an organization or a - relates a person to an organization or a geopolitical entitygeopolitical entity Subtypes: member, owner, affiliate, client, citizen

PART PART - generalized containment- generalized containment Subtypes: subsidiary, physical part-of, set membership

AT AT - permanent and transient locations- permanent and transient locations Subtypes: located, based-in, residence

SOCIALSOCIAL- social relations among persons- social relations among persons Subtypes: parent, sibling, spouse, grandparent,

associate

Slide from Doug Appelt

Page 54: Relation Extraction

Frequent Freebase Frequent Freebase RelationsRelations

aa

Page 55: Relation Extraction

Predicting the “is-a” Predicting the “is-a” relationrelation

How can we capture the variability of expression of a relation in natural text from a large, unannotated corpus?

“...works by such authors as Herrick, Goldsmith, and Shakespeare.”

“If you consider authors like Shakespeare...”

“Shakespeare, author of The Tempest...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

Shakespeare IS-A author (0.87)

Page 56: Relation Extraction

Why this is hard: Why this is hard: Ambiguity!Ambiguity!

Treatment Disease

Cure?

Prevent?

Side Effect?

Which relations hold between 2 Which relations hold between 2 entities?entities?

Page 57: Relation Extraction

Different relations between Disease Different relations between Disease (Hepatitis) and Treatment(Hepatitis) and Treatment

CureCure These results suggest that con A-induced

hepatitis was ameliorated by pretreatment with TJ-135.

PreventPrevent A two-dose combined hepatitis A and B vaccine

would facilitate immunization programsVagueVague

Effect of interferon on hepatitis B

Slide from B. Rosario and M. Hearst

Page 58: Relation Extraction

5 easy methods for relation 5 easy methods for relation extractionextraction

1.1. Hand-built patternsHand-built patterns2.2. Bootstrapping (seed) methodsBootstrapping (seed) methods

3.3. Supervised methodsSupervised methods

4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision

Page 59: Relation Extraction

5 easy methods for relation 5 easy methods for relation extractionextraction

1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods

4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision

Page 60: Relation Extraction

A complex hand-built extraction A complex hand-built extraction rule [NYU Proteus]rule [NYU Proteus]

Page 61: Relation Extraction

Goal: Add hyponyms to Goal: Add hyponyms to WordNet directly from text.WordNet directly from text.

Intuition from Intuition from Hearst (1992) Hearst (1992) “Agar is a substance prepared from a mixture of

red algae, such as Gelidium, for laboratory or industrial use”

What does What does GelidiumGelidium mean? mean? How do you know?How do you know?

Page 62: Relation Extraction

Goal: Add hyponyms to WordNet Goal: Add hyponyms to WordNet directly from text.directly from text.

Intuition from Intuition from Hearst (1992) Hearst (1992) “Agar is a substance prepared from a mixture of

red algae, such as Gelidium, for laboratory or industrial use”

What does What does GelidiumGelidium mean? mean? How do you know?How do you know?

Page 63: Relation Extraction

Hearst’s Hand-Designed Lexico-Hearst’s Hand-Designed Lexico-Syntactic PatternsSyntactic Patterns

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and/or) X)”“such Y as X…”“X… or other Y”“X… and other Y”“Y including X…”“Y, especially X…”

Page 64: Relation Extraction

Hearst’s hand-built patterns for Hearst’s hand-built patterns for Relation ExtractionRelation Extraction

Hearst pattern Example occurrencesX and other Y ...temples, treasuries, and other important civic

buildings.X or other Y Bruises, wounds, broken bones or other

injuries...Y such as X The bow lute, such as the Bambara ndang...

Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X ...common-law countries, including Canada and England...

Y , especially X European countries, especially France, England, and Spain...

Page 65: Relation Extraction

Problem with hand-built Problem with hand-built patternspatterns

Requires that we hand-build patterns for Requires that we hand-build patterns for each relation!each relation! don’t want to have to do this for all

possible relations! we’d like better accuracy

Page 66: Relation Extraction

5 easy methods for relation 5 easy methods for relation extractionextraction

1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods

4.4. Unsupervised methodsUnsupervised methods

Page 67: Relation Extraction

2. Supervised Relation 2. Supervised Relation ExtractionExtraction

Sometimes done in 3 stepsSometimes done in 3 steps1. Find all pairs of named entities2. Decide if 2 entities are related3. If yes, classifying the relation

Why the extra step?Why the extra step? Cuts down on training time for classification

by eliminating most pairs Producing separate feature-sets that are

appropriate for each task.

Page 68: Relation Extraction

Relation AnalysisRelation Analysis

Usually just run on named entities within the Usually just run on named entities within the same sentencesame sentence

Slide from Jim Martin

Page 69: Relation Extraction

Relation ExtractionRelation Extraction

Task definition: to label the semantic relation Task definition: to label the semantic relation between a pair of entities in a sentence between a pair of entities in a sentence (fragment)(fragment)

Slide from Jing Jiang

…[leader arg-1] of a minority [government arg-2]…

PHYS PER-SOC EMP-ORG NIL

PHYS: PhysicalPER-SOC: Personal / SocialEMP-ORG: Employment / Membership / Subsidiary

Page 70: Relation Extraction

Supervised LearningSupervised Learning

Supervised machine learning Supervised machine learning ((e.g. [e.g. [Zhou et al. 2005], Zhou et al. 2005], [Bunescu & Mooney 2005], [Zhang et al. 2006[Bunescu & Mooney 2005], [Zhang et al. 2006], [Surdeanu & Ciaramita ], [Surdeanu & Ciaramita 2007]2007]))

Training data is needed for each relation typeTraining data is needed for each relation type

…[leader arg-1] of a minority [government arg-2]…

arg-1 word: leader arg-2 type: ORG

dependency:arg-1 of arg-2

EMP-ORGPHYS PER-SOC NIL

Slide from Jing Jiang

Page 71: Relation Extraction

ACE 2008 Six RelationsACE 2008 Six Relations

Page 72: Relation Extraction

Features: WordsFeatures: Words

Headwords of M1 and M2, and combinationHeadwords of M1 and M2, and combination• George Washington Bridge

Bag of words and bigrams in M1 and M2Bag of words and bigrams in M1 and M2 Words or bigrams in particular positions to the Words or bigrams in particular positions to the

left and right of the M1 and M2left and right of the M1 and M2• +/- 1, 2, 3

Bag of words or bigrams between the two entitiesBag of words or bigrams between the two entities

Page 73: Relation Extraction

Features: Named Entity Type Features: Named Entity Type and Mention Leveland Mention Level Named-entity types (ORG, LOC, etc)Named-entity types (ORG, LOC, etc) Concatenation of the typesConcatenation of the types Entity Level of M1 and M2 Entity Level of M1 and M2

(NAME, NOMINAL, PRONOUN)

Page 74: Relation Extraction

Features: Parse Tree and Features: Parse Tree and Base PhrasesBase Phrases

Syntactic environmentSyntactic environment Constituent path through the tree from

one to the other Base syntactic chunk sequence from one

to the other Dependency path

Slide from Jim Martin

Page 75: Relation Extraction

Features: Gazeteers and Features: Gazeteers and trigger wordstrigger words

Personal relative trigger listPersonal relative trigger list from wordnet: parent, wife, husabnd, grandparent,

etc

Country name listCountry name list

Page 76: Relation Extraction

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Page 77: Relation Extraction

Classifiers for supervised Classifiers for supervised methodsmethods

Now you can use any classifier you likeNow you can use any classifier you like SVM Logistic regression Naïve Bayes etc

Page 78: Relation Extraction

SummarySummary

Can get high accuracies with enough hand-Can get high accuracies with enough hand-labeled training data labeled training data

If test data looks exactly like the training dataIf test data looks exactly like the training data ButBut

labeling 5000 relations (and named entities) is expensive

the approach doesn’t generalize to different genres

Page 79: Relation Extraction

5 easy methods for relation 5 easy methods for relation extractionextraction

1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) Bootstrapping (seed)

methodsmethods

4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision

Page 80: Relation Extraction

Bootstrapping ApproachesBootstrapping Approaches

What if you don’t have enough annotated text to train on.What if you don’t have enough annotated text to train on. But you might have some seed tuples Or you might have some patterns that work pretty well

Can you use those seeds to do something useful?Can you use those seeds to do something useful? Co-training and active learning use the seeds to train

classifiers to tag more data to train better classifiers... Bootstrapping tries to learn directly (populate a relation)

through direct use of the seeds

Slide from Jim Martin

Page 81: Relation Extraction

Bootstrapping Example: Bootstrapping Example: Seed TupleSeed Tuple

<Mark Twain, Elmira> <Mark Twain, Elmira> Seed tupleSeed tuple Grep (google) “Mark Twain is buried in Elmira, NY.”

• X is buried in Y

“The grave of Mark Twain is in Elmira”• The grave of X is in Y

“Elmira is Mark Twain’s final resting place”• Y is X’s final resting place.

Use those patterns to grep for new tuples that you Use those patterns to grep for new tuples that you don’t already knowdon’t already know

Slide from Jim Martin

Page 82: Relation Extraction

Hearst (1992) proposal for Hearst (1992) proposal for bootstrappingbootstrapping

Choose lexical relation RChoose lexical relation R Gather a set of pairs that have this relationGather a set of pairs that have this relation Find places in the corpus where these Find places in the corpus where these

expressions occur near each other and expressions occur near each other and record the environmentrecord the environment

Find the commonalities among these Find the commonalities among these environments and hypothesize that common environments and hypothesize that common ones yield patterns that indicate the relation ones yield patterns that indicate the relation of interestof interest

Page 83: Relation Extraction

Bootstrapping RelationsBootstrapping Relations

Slide from Jim Martin

Page 84: Relation Extraction

Dipre (Brin 1998)Dipre (Brin 1998)

Extract <author, book> pairsExtract <author, book> pairs Start with these 5 seedsStart with these 5 seeds

Learn these patterns:Learn these patterns:

Now iterate, using these patterns to get more Now iterate, using these patterns to get more instances and patterns…instances and patterns…

Page 85: Relation Extraction

Snowball Snowball [Agichtein & [Agichtein & Gravano 2000]Gravano 2000] Exploit duality between patterns and Exploit duality between patterns and

tuplestuples

- find tuples that match a set of patterns- find patterns that match a lot of tuples

bootstrapping approach

Initial Seed Tuples Occurrences of Seed Tuples

Tag Entities

Generate Extraction Patterns

Generate New Seed Tuples

Augment Table

Page 86: Relation Extraction

SnowballSnowball

ORGANIZATION LOCATIONMICROSOFT REDMONDIBM ARMONKBOEING SEATTLEINTEL SANTA CLARA

initial seed tuples

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, share ofRedmond-based Microsoft fell…

The Armonk-based IBM introduceda new line…

The combined company will operate

from Boeing’s headquarters in Seattle.

Intel, Santa Clara, cut prices of itsPentium processor.

occurrences of seed tuples

Page 87: Relation Extraction

Snowball Snowball

Require that X and Y be named entities of particular typesRequire that X and Y be named entities of particular types

{<’s 0.7> <headquarters 0.7> <in 0.7> }ORGANIZATION LOCATIO

N

{<- 0.75> <based 0.75>}

ORGANIZATIONLOCATION

Page 88: Relation Extraction

PatternsPatterns

(extraction)(extraction) patternpattern has format <left, tag1, has format <left, tag1, middle, tag2, right>, middle, tag2, right>,

where tag1, tag2 are named-entity tags and left, middle, and right are vectors of weighted terms

•patterns derived directly from occurrences patterns derived directly from occurrences are too specificare too specific

< left , tag1 , middle , tag2 , right >

ORGANIZATION 's central headquarters in LOCATION is home to...

LOCATIONORGANIZATION{<'s 0.5>, <central 0.5> <headquarters 0.5>, < in 0.5>}

{<is 0.75>, <home 0.75> }

Page 89: Relation Extraction

Pattern ClustersPattern Clusters

cluster patterns, cluster centroids define patternscluster patterns, cluster centroids define patterns

ORGANIZATION

{<servers 0.75><at 0.75>}

{<’s 0.5> <central 0.5> <headquarters 0.5> <in 0.5>}

LOCATION

ORGANIZATION

{<operate 0.75><from 0.75>}

{<’s 0.7> <headquarters 0.7> <in 0.7>}

LOCATION

ORGANIZATION

Cluster 1

{<shares 0.75><of 0.75>}

{<- 0.75> <based 0.75> }

{<fell 1>}

{<the 1>}

{<- 0.75> <based 0.75> }

{<introduced 0.75> <a 0.75>}LOCATION

ORGANIZATION

ORGANIZATION

Cluster 2

LOCATION

Page 90: Relation Extraction

5 easy methods for 5 easy methods for relation extractionrelation extraction

1.1. Hand-built patternsHand-built patterns2.2. Supervised methodsSupervised methods3.3. Bootstrapping (seed) methodsBootstrapping (seed) methods

4.4. Unsupervised methodsUnsupervised methods5.5. Distant supervisionDistant supervision

Page 91: Relation Extraction

Distant supervision Distant supervision paradigmparadigm

Instead of hand-creating 5 seed examplesInstead of hand-creating 5 seed examples Use a large database to get our seed Use a large database to get our seed

examplesexamples lots of examples supervision from a database, not a corpus!

• Not genre-dependent!

Create lots and lots of noisy features from all these examples

Combine in a classifier

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17

Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relation extraction without labeled data. ACL-2009.

Page 92: Relation Extraction

Distant supervision Distant supervision paradigmparadigm

Has advantages of supervised classification:Has advantages of supervised classification: use of rich of hand-created knowledge

Has advantages of unsupervised Has advantages of unsupervised classification:classification: infinite amounts of data allows for very large number of weak

features not sensitive to training corpus

Page 93: Relation Extraction

Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”

We construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.

Slide from Rion Snow

Suppose scientists could erase memories with a single substance in the brain.

Page 94: Relation Extraction

Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”

This leads to high-signal examples like:

“...consider authors like Shakespeare...”

“Shakespeare, author of The Tempest...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

Construct a noisy training set consisting of occurrences from our corpus that contain an IS-A pair according to WordNet.

Slide from Rion Snow

Page 95: Relation Extraction

Relation Classification with Relation Classification with “Distant Supervision”“Distant Supervision”

This leads to high-signal examples like:

But noisy examples like:

“...consider authors like Shakespeare...”

“Shakespeare, author of The Tempest...”

“Some authors (including Shakespeare)...”

“Shakespeare was the author of several...”

“The author of Shakespeare in Love...”

“...authors at the Shakespeare Festival...”

Training set (TREC and Wikipedia):14,000 hypernym pairs, ~600,000 total pairs

Slide from Rion Snow

Page 96: Relation Extraction

How to learn patternsHow to learn patterns

Take corpus sentencesTake corpus sentences

Collect noun pairsCollect noun pairs752,311 pairs from 6M words of 752,311 pairs from 6M words of

newswirenewswire

Is pair an IS-A in WordNet? Is pair an IS-A in WordNet? 14,387 yes, 737,924 no

Parse the sentencesParse the sentences

Extract patternsExtract patterns69,592 dependency paths >5 69,592 dependency paths >5

pairs)pairs)

Train classifier on these Train classifier on these patternspatterns

Logistic regressionLogistic regression with 70K with 70K featuresfeatures(actually converted to (actually converted to 974,288 bucketed binary 974,288 bucketed binary features)features)

… doubly heavy hydrogen atom called deuterium…

44

11

22

33

55

66

(Atom, deuterium)

YES

Snow, Jurafsky, Ng. 2005. Learning syntactic patterns for automatic hypernym discovery. NIPS 17

Page 97: Relation Extraction

One of 70,000 patternsOne of 70,000 patterns

“<superordinate> ‘called’ <subordinate>”

Learned from cases such as:

“sarcoma / cancer”: …an uncommon bone cancer called osteogenic sarcoma and to…“deuterium / atom” ….heavy water rich in the doubly heavy hydrogen atom called deuterium.

New pairs discovered:

“efflorescence / condition”: …and a condition called efflorescence are other reasons for… “’neal_inc / company” …The company, now called O'Neal Inc., was sole distributor of E-Ferol…“hat_creek_outfit / ranch” …run a small ranch called the Hat Creek Outfit.“hiv-1 / aids_virus” …infected by the AIDS virus, called HIV-1.“bateau_mouche / attraction” …local sightseeing attraction called the Bateau Mouche...“kibbutz_malkiyya / collective_farm” …an Israeli collective farm called Kibbutz Malkiyya…

Page 98: Relation Extraction

Hypernym Precision / Recall for all Hypernym Precision / Recall for all FeaturesFeatures

Slide from Rion Snow

Page 99: Relation Extraction

logistic regression

Idea: use each pattern as a Idea: use each pattern as a feature!!!!feature!!!!Precision/Recall for Hypernym Precision/Recall for Hypernym Classification:Classification:

10-fold Cross Validation on 14,000 WordNet-Labeled Pairs

Slide from Rion Snow

Page 100: Relation Extraction

Extracting more relations with Extracting more relations with distant supervisiondistant supervision

2700 relations > 10 instances

5.2 million instances3.7 million entities

Training SetCorpus

2.3 million articles31.5 million sentences

Mintz, Bills, Snow, Jurafsky (2009) Distant supervision for relationextraction without labeled data. ACL-2009.

Page 101: Relation Extraction

Frequent Freebase Frequent Freebase RelationsRelations

aa

Page 102: Relation Extraction

Algorithm: Distant Algorithm: Distant SupervisionSupervision

A kind of A kind of weakly supervised learningweakly supervised learning Use a large database to get seed

examples Create lots and lots of noisy pattern

features from all these examples Combine in a classifier

Page 103: Relation Extraction

Extract parse and other Extract parse and other featuresfeatures

“Astronomer Edward Hubble was born in Marshfield, Missouri”

“Named entities”

Edwin Hubble is a PERSON

Marshfield is a LOCATION Lexical items nearby…

Page 104: Relation Extraction

New relations learnedNew relations learned

Montmartre Montmartre IS-IN IS-IN ParisParis Fort Erie Fort Erie IS-IN IS-IN OntarioOntario Fyoder Kamesnky Fyoder Kamesnky DIED-IN DIED-IN ClearwaterClearwater Utpon Sinclair Utpon Sinclair WROTE WROTE Lanny BuddLanny Budd Vince McMahon Vince McMahon FOUNDED FOUNDED WWEWWE Thomas Mellon Thomas Mellon HAS-PROFESSION HAS-PROFESSION JudgeJudge

Page 105: Relation Extraction

Human evaluation: precision Human evaluation: precision using Mechanical Turk using Mechanical Turk labelerslabelersFeatureFeature PrecisionPrecision

SyntacticSyntactic .67.67

Lexical Lexical .66.66

BothBoth .69.69

Page 106: Relation Extraction

Where syntactic Where syntactic knowledge helpsknowledge helps

Back Street Back Street is a 1932 film made by Universal is a 1932 film made by Universal Pictures, Pictures, directed directed by by John M. StahlJohn M. Stahl, and , and produced by Carl Laemmle Jr.produced by Carl Laemmle Jr.

Back Street Back Street and and John M. Stahl John M. Stahl are very far are very far apart in surface stringapart in surface string

But are close together in dependency parseBut are close together in dependency parse

Page 107: Relation Extraction

Unsupervised relation Unsupervised relation extractionextraction Banko et al 2007 “Open information extraction Banko et al 2007 “Open information extraction

from the Web”from the Web” Extracting relations from the web withExtracting relations from the web with

no training data no predetermined list of relations

The Open ApproachThe Open Approach1.Use parse data to train a “trust-worthy” classifier

2.Extract trustworthy relations among NPs

3.Rank relations based on text redundancy