lecture 14 relation extraction

52
Lecture 14 Relation Extraction Topics Topics Relation Extraction Readings: Chapter 22 Readings: Chapter 22 NLTK 7.4-7.5 March 4, 2013 CSCE 771 Natural Language Processing

Upload: sanura

Post on 18-Mar-2016

130 views

Category:

Documents


0 download

DESCRIPTION

Lecture 14 Relation Extraction. CSCE 771 Natural Language Processing. Topics Relation Extraction Readings: Chapter 22 NLTK 7.4-7.5. March 4, 2013. Overview. Last Time NER NLTK Chunking Example 7.4 (code_chunker1.py) , chinking Example 7.5 (code_chinker.py) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture 14  Relation Extraction

Lecture 14 Relation Extraction

Topics Topics Relation Extraction

Readings: Chapter 22Readings: Chapter 22 NLTK 7.4-7.5

March 4, 2013

CSCE 771 Natural Language Processing

Page 2: Lecture 14  Relation Extraction

– 2 – CSCE 771 Spring 2013

OverviewLast TimeLast Time

NER NLTK

Chunking Example 7.4 (code_chunker1.py), chinking Example 7.5 (code_chinker.py) Evaluation Example 7.8 (code_unigram_chunker.py) Example 7.9 (code_classifier_chunker.py)

TodayToday Relation extraction ACE: Freebase, DBPedia Ontological relations Rules for IS-A extracting Supervised Relation Extraction for relations Relation Bootstrapping Unsupervised relation extraction NLTK 7.5 Named Entity Recognition

ReadingsReadings NLTK Ch 7.4 - 7.5

Page 3: Lecture 14  Relation Extraction

– 3 – CSCE 771 Spring 2013

Dear Dr. Mathews,

I have the following questions:

1. (c) Do you need the regular expression that will capture the link inside href="..."?

    (d) What kind of description you want? It is a python function with no argument. Do you want answer like that?

3. (f-g) Do you mean top 100 in terms of count?

4.(e-f) You did not show how to use nltk for HMM and Brill tagging. Can you please give an example?

-Thanks

Page 4: Lecture 14  Relation Extraction

– 4 – CSCE 771 Spring 2013

Relation ExtractionWhat is relation extraction?What is relation extraction?

Founded in 1801 as South Carolina College, USC is the Founded in 1801 as South Carolina College, USC is the flagship institution of the University of South Carolina flagship institution of the University of South Carolina System and offers more than 350 programs of study System and offers more than 350 programs of study leading to bachelor's, master's, and doctoral degrees leading to bachelor's, master's, and doctoral degrees from fourteen degree-granting colleges and schools to from fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki]30,967 on the main Columbia campus. … [wiki]

complex relation = summarizationcomplex relation = summarization

focus on binary relation predicate(subject, object) or focus on binary relation predicate(subject, object) or triples <subj predicate obj>triples <subj predicate obj>

Page 5: Lecture 14  Relation Extraction

– 5 – CSCE 771 Spring 2013

Wiki Info Box – structured datatemplate template • standard things about standard things about

UniversitiesUniversities• EstablishedEstablished• typetype• facultyfaculty• studentsstudents• locationlocation• mascotmascot

Page 6: Lecture 14  Relation Extraction

– 6 – CSCE 771 Spring 2013

Focus on extracting binary relations• predicate(subject, object) from predicate logicpredicate(subject, object) from predicate logic

• triples <subj relation object>triples <subj relation object>

• Directed graphsDirected graphs

Page 7: Lecture 14  Relation Extraction

– 7 – CSCE 771 Spring 2013

Why relation extraction?create new structured KBcreate new structured KB

Augmenting existing: words -> wordnet, facts -> FreeBase or DBPediaAugmenting existing: words -> wordnet, facts -> FreeBase or DBPedia

Support question answering: Jeopardy Support question answering: Jeopardy

Which relationsWhich relations

Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/

17 relations17 relations

ACE examplesACE examples

Page 8: Lecture 14  Relation Extraction

– 8 – CSCE 771 Spring 2013

Unified Medical Language System (UMLS)UMLS: Unified Medical 134 entities, 54 relationsUMLS: Unified Medical 134 entities, 54 relations

http://www.nlm.nih.gov/research/umls/

Page 9: Lecture 14  Relation Extraction

– 9 – CSCE 771 Spring 2013

UMLS semantic network

Page 10: Lecture 14  Relation Extraction

– 10 – CSCE 771 Spring 2013

Current Relations in the UMLS Semantic Network  isa isa

    associated_with     associated_with         physically_related_to         physically_related_to             part_of             part_of             consists_of             consists_of             contains             contains             connected_to             connected_to             interconnects             interconnects             branch_of             branch_of             tributary_of             tributary_of             ingredient_of             ingredient_of         spatially_related_to         spatially_related_to             location_of             location_of             adjacent_to             adjacent_to             surrounds             surrounds             traverses             traverses         functionally_related_to         functionally_related_to             affects             affects                  …                  …                                 

……

temporally_related_to temporally_related_to              co-occurs_with              co-occurs_with              precedes             precedes

conceptually_related_toconceptually_related_to

evaluation_of evaluation_of              degree_of              degree_of              analyzes              analyzes                  assesses_effect_of                  assesses_effect_of              measurement_of              measurement_of              measures              measures              diagnoses              diagnoses              property_of              property_of              derivative_of              derivative_of              developmental_form_of              developmental_form_of              method_of              method_of              …             …

Page 11: Lecture 14  Relation Extraction

– 11 – CSCE 771 Spring 2013

Databases of Wikipedia Relations• DBpedia is a crowd-sourced community effortDBpedia is a crowd-sourced community effort• to extract structured information from Wikipediato extract structured information from Wikipedia• and to make this information readily availableand to make this information readily available• DBpedia allows you to make sophisticated queriesDBpedia allows you to make sophisticated queries

http://dbpedia.org/About

Page 12: Lecture 14  Relation Extraction

– 12 – CSCE 771 Spring 2013

English version of the DBpedia knowledge base• 3.77 million things3.77 million things• 2.35 million are classified in an ontology2.35 million are classified in an ontology• including:including:

• including 764,000 persons, • 573,000 places (including 387,000 populated places), • 333,000 creative works (including 112,000 music albums,

72,000 films and 18,000 video games), • 192,000 organizations (including 45,000 companies

and 42,000 educational institutions), • 202,000 species and • 5,500 diseases.

Page 13: Lecture 14  Relation Extraction

– 13 – CSCE 771 Spring 2013

freebasegoogle (freebase wiki) google (freebase wiki)

http://wiki.freebase.com/wiki/Main_Pagehttp://wiki.freebase.com/wiki/Main_Page

Page 14: Lecture 14  Relation Extraction

– 14 – CSCE 771 Spring 2013

Ontological relations

Ontological relationsOntological relations• IS-A hypernymIS-A hypernym• Instance-ofInstance-of• has-Parthas-Part• hyponym (opposite of hypernym)hyponym (opposite of hypernym)

Page 15: Lecture 14  Relation Extraction

– 15 – CSCE 771 Spring 2013

How to build extractors

Page 16: Lecture 14  Relation Extraction

– 16 – CSCE 771 Spring 2013

Extracting IS_A relation(Hearst 1992) Atomatic Acquisition of hypernyms(Hearst 1992) Atomatic Acquisition of hypernyms

Naproxen sodium is a nonsteroidal anti-inflammatory Naproxen sodium is a nonsteroidal anti-inflammatory drug (NSAID). [wiki]drug (NSAID). [wiki]

Page 17: Lecture 14  Relation Extraction

– 17 – CSCE 771 Spring 2013

Hearst's Patterns for IS-A extractingPatterns for <X IS-A Y>Patterns for <X IS-A Y>

““Y such as X”Y such as X”

““X or other Y”X or other Y”

““Y including X”Y including X”

““Y, especially X”Y, especially X”

Page 18: Lecture 14  Relation Extraction

– 18 – CSCE 771 Spring 2013

Extracting Richer RelationsExtracting Richer Relations Using Specific RulesExtracting Richer Relations Using Specific Rules

Intuition: relations that commonly hold: located-in, Intuition: relations that commonly hold: located-in, cures, ownscures, owns

What relations hold between two entitiesWhat relations hold between two entities

Page 19: Lecture 14  Relation Extraction

– 19 – CSCE 771 Spring 2013

Fig 22.16 Pattern and Bootstrapping

Page 20: Lecture 14  Relation Extraction

– 20 – CSCE 771 Spring 2013

Hand-built patterns for relationsHand-built patterns for relationsHand-built patterns for relations

ProsPros

ConsCons

Page 21: Lecture 14  Relation Extraction

– 21 – CSCE 771 Spring 2013

Supervised Relation Extraction

How to do Classification is supervise relation extractionHow to do Classification is supervise relation extraction

1 find all pairs of named entities1 find all pairs of named entities

2. decides if they are realted2. decides if they are realted

3,3,

Page 22: Lecture 14  Relation Extraction

– 22 – CSCE 771 Spring 2013

ACE- Automated Content Extraction• http://projects.ldc.upenn.edu/ace/http://projects.ldc.upenn.edu/ace/• Linguistic Data Consortium Linguistic Data Consortium • Entity Detection and Tracking (EDT) isEntity Detection and Tracking (EDT) is• Relation Detection and Characterization (RDC)Relation Detection and Characterization (RDC)• Event Detection and Characterization (EDC)Event Detection and Characterization (EDC)• 6 classes of relations 17 overall6 classes of relations 17 overall

Page 23: Lecture 14  Relation Extraction

– 23 – CSCE 771 Spring 2013

Word features for relation ExtractionWord features for relation ExtractionWord features for relation Extraction

Headwords of M1 and M2Headwords of M1 and M2

• Named Entity Type and Named Entity Type and • Mention Level Features for relation extractionMention Level Features for relation extraction

• name, pronoun, nominal

Page 24: Lecture 14  Relation Extraction

– 24 – CSCE 771 Spring 2013

Parse Features for Relation ExtractionParse Features for Relation ExtractionParse Features for Relation Extraction

base syntatic chuck seq from one to anotherbase syntatic chuck seq from one to another

constituent pathconstituent path

Dependency pathDependency path

Page 25: Lecture 14  Relation Extraction

– 25 – CSCE 771 Spring 2013

Gazeteer and trigger word features for relation extractionTrigger list fo Trigger list fo kinship relationskinship relations

Gazeteer: name-listGazeteer: name-list

Page 26: Lecture 14  Relation Extraction

– 26 – CSCE 771 Spring 2013

Evaluation of Supervised Relation Extraction

Evaluation of Supervised Relation ExtractionEvaluation of Supervised Relation Extraction• P/R/FP/R/F

SummarySummary

+ hgh accuracies+ hgh accuracies

- training set- training set

models are brittlemodels are brittle

don't generalize welldon't generalize well

Page 27: Lecture 14  Relation Extraction

– 27 – CSCE 771 Spring 2013

Semi-Supervised Relation Extraction

Seed-based or bootstrapping approaches to RESeed-based or bootstrapping approaches to RE

No training setNo training set

Can you … do anything?Can you … do anything?

BootsrappingBootsrapping

Page 28: Lecture 14  Relation Extraction

– 28 – CSCE 771 Spring 2013

Relation BootstrappingRelation Bootstrapping (Hearst 1992)Relation Bootstrapping (Hearst 1992)

Gather seed pairs of relation RGather seed pairs of relation R

iterate iterate

1.1. find sentences with pairs, find sentences with pairs,

2.2. look at context... look at context...

3.3. use patterns to search for more pairsuse patterns to search for more pairs

Page 29: Lecture 14  Relation Extraction

– 29 – CSCE 771 Spring 2013

Bootstrapping Example

Page 30: Lecture 14  Relation Extraction

– 30 – CSCE 771 Spring 2013

Extract <author, book> pairsDipre: start with seedsDipre: start with seeds

Find instancesFind instances

Extract patternsExtract patterns

Now iterateNow iterate

Page 31: Lecture 14  Relation Extraction

– 31 – CSCE 771 Spring 2013

Snowball Algorithm Agichtein, Gravano 2000Snowball Algorithm by Agichtein, Gravano 2000 Snowball Algorithm by Agichtein, Gravano 2000

Distant supervisionDistant supervision

Distant supervision paradigmDistant supervision paradigm

Like classifiedLike classified

Page 32: Lecture 14  Relation Extraction

– 32 – CSCE 771 Spring 2013

Unsupervised relation extraction

Banko et al 2007 “Open information extraction from the Banko et al 2007 “Open information extraction from the Web”Web”

Extracting relations from the web withExtracting relations from the web with• no training datano training data• no predetermined list of relationsno predetermined list of relations• The Open ApproachThe Open Approach

1.1. Use parse data to train a “trust-worthy” classifierUse parse data to train a “trust-worthy” classifier

2.2. Extract trustworthy relations among NPsExtract trustworthy relations among NPs

3.3. Rank relations based on text redundancyRank relations based on text redundancy

Page 33: Lecture 14  Relation Extraction

– 33 – CSCE 771 Spring 2013

Evaluation of Semi-supervised and Unsupervised REEvaluation of Semi-supervised and Unsupervised REEvaluation of Semi-supervised and Unsupervised RE

No gold std ... the web is not taggedNo gold std ... the web is not tagged• no way to compute precision or recallno way to compute precision or recall

Instead only estimate precisionInstead only estimate precision• draw sample check precision manually draw sample check precision manually • alternatively choose several levels of recall and alternatively choose several levels of recall and

check the precision therecheck the precision there

No way to check the recall?No way to check the recall?• randomly select text sample and manually checkrandomly select text sample and manually check

Page 34: Lecture 14  Relation Extraction

– 34 – CSCE 771 Spring 2013

NLTK Info. Extraction..

Page 35: Lecture 14  Relation Extraction

– 35 – CSCE 771 Spring 2013

NLTK ReviewNLTK 7.1-7.3NLTK 7.1-7.3

Chunking Example 7.4 (code_chunker1.py), chinking Example 7.5 (code_chinker.py) simple re_chunker Evaluation Example 7.8 (code_unigram_chunker.py) Example 7.9 (code_classifier_chunker.py

Page 36: Lecture 14  Relation Extraction

– 36 – CSCE 771 Spring 2013

Review 7.4: Simple Noun Phrase Chunker

grammar = r"""grammar = r"""

NP: {<DT|PP\$>?<JJ>*<NN>} # chunk NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nounsdeterminer/possessive, adjectives and nouns

{<NNP>+} # chunk sequences of proper nouns{<NNP>+} # chunk sequences of proper nouns

""""""

cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)

sentence = [("Rapunzel", "NNP"), ("let", "VBD"), sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]("golden", "JJ"), ("hair", "NN")]

print cp.parse(sentence)print cp.parse(sentence)

Page 37: Lecture 14  Relation Extraction

– 37 – CSCE 771 Spring 2013

(S(S

(NP Rapunzel/NNP)(NP Rapunzel/NNP)

let/VBDlet/VBD

down/RPdown/RP

(NP her/PP$ long/JJ golden/JJ hair/NN))(NP her/PP$ long/JJ golden/JJ hair/NN))

Page 38: Lecture 14  Relation Extraction

– 38 – CSCE 771 Spring 2013

Review 7.5: Simple Noun Phrase Chinker

grammar = r"""grammar = r"""

NP:NP:

{<.*>+} # Chunk everything{<.*>+} # Chunk everything

}<VBD|IN>+{ # Chink sequences of VBD and IN}<VBD|IN>+{ # Chink sequences of VBD and IN

""""""

sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]"NN")]

cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)

print cp.parse(sentence)print cp.parse(sentence)

Page 39: Lecture 14  Relation Extraction

– 39 – CSCE 771 Spring 2013

>>> >>>

(S(S

(NP the/DT little/JJ yellow/JJ dog/NN)(NP the/DT little/JJ yellow/JJ dog/NN)

barked/VBDbarked/VBD

at/INat/IN

(NP the/DT cat/NN))(NP the/DT cat/NN))

>>> >>>

Page 40: Lecture 14  Relation Extraction

– 40 – CSCE 771 Spring 2013

RegExp Chunker – conll2000import nltkimport nltk

from nltk.corpus import conll2000from nltk.corpus import conll2000

cp = nltk.RegexpParser("")cp = nltk.RegexpParser("")

test_sents = conll2000.chunked_sents('test.txt', test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])chunk_types=['NP'])

print cp.evaluate(test_sents)print cp.evaluate(test_sents)

grammar = r"NP: {<[CDJNP].*>+}"grammar = r"NP: {<[CDJNP].*>+}"

cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)

print cp.evaluate(test_sents)print cp.evaluate(test_sents)

Page 41: Lecture 14  Relation Extraction

– 41 – CSCE 771 Spring 2013

ChunkParse score:ChunkParse score:

IOB Accuracy: 43.4%IOB Accuracy: 43.4%

Precision: Precision: 0.0%0.0%

Recall: Recall: 0.0%0.0%

F-Measure: F-Measure: 0.0%0.0%

ChunkParse score:ChunkParse score:

IOB Accuracy: IOB Accuracy: 87.7%87.7%

Precision: Precision: 70.6%70.6%

Recall: Recall: 67.8%67.8%

F-Measure: F-Measure: 69.2%69.2%

Page 42: Lecture 14  Relation Extraction

– 42 – CSCE 771 Spring 2013

Conference on Computational Natural Language LearningConference on Computational Natural Language Conference on Computational Natural Language

Learning (CoNLL-2000)Learning (CoNLL-2000)

http://www.cnts.ua.ac.be/conll2000/chunking/http://www.cnts.ua.ac.be/conll2000/chunking/

CoNLL 2013 : Seventeenth Conference on CoNLL 2013 : Seventeenth Conference on Computational Natural Language LearningComputational Natural Language Learning

Page 43: Lecture 14  Relation Extraction

– 43 – CSCE 771 Spring 2013

Evaluation Example 7.8 (code_unigram_chunker.py)

AttributeError: 'module' object has no attribute AttributeError: 'module' object has no attribute 'conlltags2tree''conlltags2tree'

Page 44: Lecture 14  Relation Extraction

– 44 – CSCE 771 Spring 2013

code_classifier_chunker.pycode_classifier_chunker.py

NLTK was unable to find the megam file!NLTK was unable to find the megam file!

Use software specific configuration paramaters or set Use software specific configuration paramaters or set the MEGAM environment variable.the MEGAM environment variable.

For more information, on megam, see:For more information, on megam, see:

<http://www.cs.utah.edu/~hal/megam/><http://www.cs.utah.edu/~hal/megam/>

Page 45: Lecture 14  Relation Extraction

– 45 – CSCE 771 Spring 2013

7.4   Recursion in Linguistic Structure

Page 46: Lecture 14  Relation Extraction

– 46 – CSCE 771 Spring 2013

code_cascaded_chunker

grammar = r"""grammar = r"""

NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NNNP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN

PP: {<IN><NP>} # Chunk prepositions followed by NPPP: {<IN><NP>} # Chunk prepositions followed by NP

VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their argumentsVP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments

CLAUSE: {<NP><VP>} # Chunk NP, VPCLAUSE: {<NP><VP>} # Chunk NP, VP

""""""

cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)

sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),"NN"),

("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

print cp.parse(sentence)print cp.parse(sentence)

Page 47: Lecture 14  Relation Extraction

– 47 – CSCE 771 Spring 2013

>>> >>>

(S(S

(NP Mary/NN)(NP Mary/NN)

saw/VBDsaw/VBD

(CLAUSE(CLAUSE

(NP the/DT cat/NN)(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

Page 48: Lecture 14  Relation Extraction

– 48 – CSCE 771 Spring 2013

A sentence having deeper nestingsentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),

("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),

("on", "IN"), ("the", "DT"), ("mat", "NN")]("on", "IN"), ("the", "DT"), ("mat", "NN")]

print cp.parse(sentence)print cp.parse(sentence)

(S(S

(NP John/NNP)(NP John/NNP)

thinks/VBZthinks/VBZ

(NP Mary/NN)(NP Mary/NN)

saw/VBDsaw/VBD

(CLAUSE(CLAUSE

(NP the/DT cat/NN)(NP the/DT cat/NN)

(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

Page 49: Lecture 14  Relation Extraction

– 49 – CSCE 771 Spring 2013

Treesprint tree4[1] print tree4[1]

(VP chased (NP the rabbit)) (VP chased (NP the rabbit))

tree4[1].node tree4[1].node

'VP‘'VP‘

tree4.leaves() tree4.leaves()

['Alice', 'chased', 'the', 'rabbit'] ['Alice', 'chased', 'the', 'rabbit']

tree4[1][1][1]tree4[1][1][1]

‘‘rabbitt’rabbitt’

tree4.draw()tree4.draw()

Page 50: Lecture 14  Relation Extraction

– 50 – CSCE 771 Spring 2013

Trees - code_traverse.pydef traverse(t):def traverse(t): try:try: t.nodet.node except AttributeError:except AttributeError: print t,print t, else:else: # Now we know that t.node is defined# Now we know that t.node is defined print '(', t.node,print '(', t.node, for child in t:for child in t: traverse(child)traverse(child) print ')',print ')',

t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')

traverse(t)traverse(t)

Page 51: Lecture 14  Relation Extraction

– 51 – CSCE 771 Spring 2013

NLTK 7.5   Named Entity Recognitionsent = nltk.corpus.treebank.tagged_sents()[22] sent = nltk.corpus.treebank.tagged_sents()[22]

print nltk.ne_chunk(sent, binary=True)print nltk.ne_chunk(sent, binary=True)

Page 52: Lecture 14  Relation Extraction

– 52 – CSCE 771 Spring 2013