lecture 14 relation extraction
DESCRIPTION
Lecture 14 Relation Extraction. CSCE 771 Natural Language Processing. Topics Relation Extraction Readings: Chapter 22 NLTK 7.4-7.5. March 4, 2013. Overview. Last Time NER NLTK Chunking Example 7.4 (code_chunker1.py) , chinking Example 7.5 (code_chinker.py) - PowerPoint PPT PresentationTRANSCRIPT
Lecture 14 Relation Extraction
Topics Topics Relation Extraction
Readings: Chapter 22Readings: Chapter 22 NLTK 7.4-7.5
March 4, 2013
CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2013
OverviewLast TimeLast Time
NER NLTK
Chunking Example 7.4 (code_chunker1.py), chinking Example 7.5 (code_chinker.py) Evaluation Example 7.8 (code_unigram_chunker.py) Example 7.9 (code_classifier_chunker.py)
TodayToday Relation extraction ACE: Freebase, DBPedia Ontological relations Rules for IS-A extracting Supervised Relation Extraction for relations Relation Bootstrapping Unsupervised relation extraction NLTK 7.5 Named Entity Recognition
ReadingsReadings NLTK Ch 7.4 - 7.5
– 3 – CSCE 771 Spring 2013
Dear Dr. Mathews,
I have the following questions:
1. (c) Do you need the regular expression that will capture the link inside href="..."?
(d) What kind of description you want? It is a python function with no argument. Do you want answer like that?
3. (f-g) Do you mean top 100 in terms of count?
4.(e-f) You did not show how to use nltk for HMM and Brill tagging. Can you please give an example?
-Thanks
– 4 – CSCE 771 Spring 2013
Relation ExtractionWhat is relation extraction?What is relation extraction?
Founded in 1801 as South Carolina College, USC is the Founded in 1801 as South Carolina College, USC is the flagship institution of the University of South Carolina flagship institution of the University of South Carolina System and offers more than 350 programs of study System and offers more than 350 programs of study leading to bachelor's, master's, and doctoral degrees leading to bachelor's, master's, and doctoral degrees from fourteen degree-granting colleges and schools to from fourteen degree-granting colleges and schools to an enrollment of approximately 45,251 students, an enrollment of approximately 45,251 students, 30,967 on the main Columbia campus. … [wiki]30,967 on the main Columbia campus. … [wiki]
complex relation = summarizationcomplex relation = summarization
focus on binary relation predicate(subject, object) or focus on binary relation predicate(subject, object) or triples <subj predicate obj>triples <subj predicate obj>
– 5 – CSCE 771 Spring 2013
Wiki Info Box – structured datatemplate template • standard things about standard things about
UniversitiesUniversities• EstablishedEstablished• typetype• facultyfaculty• studentsstudents• locationlocation• mascotmascot
– 6 – CSCE 771 Spring 2013
Focus on extracting binary relations• predicate(subject, object) from predicate logicpredicate(subject, object) from predicate logic
• triples <subj relation object>triples <subj relation object>
• Directed graphsDirected graphs
– 7 – CSCE 771 Spring 2013
Why relation extraction?create new structured KBcreate new structured KB
Augmenting existing: words -> wordnet, facts -> FreeBase or DBPediaAugmenting existing: words -> wordnet, facts -> FreeBase or DBPedia
Support question answering: Jeopardy Support question answering: Jeopardy
Which relationsWhich relations
Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/Automated Content Extraction (ACE) http://www.itl.nist.gov/iad/mig//tests/ace/
17 relations17 relations
ACE examplesACE examples
– 8 – CSCE 771 Spring 2013
Unified Medical Language System (UMLS)UMLS: Unified Medical 134 entities, 54 relationsUMLS: Unified Medical 134 entities, 54 relations
http://www.nlm.nih.gov/research/umls/
– 9 – CSCE 771 Spring 2013
UMLS semantic network
– 10 – CSCE 771 Spring 2013
Current Relations in the UMLS Semantic Network isa isa
associated_with associated_with physically_related_to physically_related_to part_of part_of consists_of consists_of contains contains connected_to connected_to interconnects interconnects branch_of branch_of tributary_of tributary_of ingredient_of ingredient_of spatially_related_to spatially_related_to location_of location_of adjacent_to adjacent_to surrounds surrounds traverses traverses functionally_related_to functionally_related_to affects affects … …
……
temporally_related_to temporally_related_to co-occurs_with co-occurs_with precedes precedes
conceptually_related_toconceptually_related_to
evaluation_of evaluation_of degree_of degree_of analyzes analyzes assesses_effect_of assesses_effect_of measurement_of measurement_of measures measures diagnoses diagnoses property_of property_of derivative_of derivative_of developmental_form_of developmental_form_of method_of method_of … …
– 11 – CSCE 771 Spring 2013
Databases of Wikipedia Relations• DBpedia is a crowd-sourced community effortDBpedia is a crowd-sourced community effort• to extract structured information from Wikipediato extract structured information from Wikipedia• and to make this information readily availableand to make this information readily available• DBpedia allows you to make sophisticated queriesDBpedia allows you to make sophisticated queries
http://dbpedia.org/About
– 12 – CSCE 771 Spring 2013
English version of the DBpedia knowledge base• 3.77 million things3.77 million things• 2.35 million are classified in an ontology2.35 million are classified in an ontology• including:including:
• including 764,000 persons, • 573,000 places (including 387,000 populated places), • 333,000 creative works (including 112,000 music albums,
72,000 films and 18,000 video games), • 192,000 organizations (including 45,000 companies
and 42,000 educational institutions), • 202,000 species and • 5,500 diseases.
– 13 – CSCE 771 Spring 2013
freebasegoogle (freebase wiki) google (freebase wiki)
http://wiki.freebase.com/wiki/Main_Pagehttp://wiki.freebase.com/wiki/Main_Page
– 14 – CSCE 771 Spring 2013
Ontological relations
Ontological relationsOntological relations• IS-A hypernymIS-A hypernym• Instance-ofInstance-of• has-Parthas-Part• hyponym (opposite of hypernym)hyponym (opposite of hypernym)
– 15 – CSCE 771 Spring 2013
How to build extractors
– 16 – CSCE 771 Spring 2013
Extracting IS_A relation(Hearst 1992) Atomatic Acquisition of hypernyms(Hearst 1992) Atomatic Acquisition of hypernyms
Naproxen sodium is a nonsteroidal anti-inflammatory Naproxen sodium is a nonsteroidal anti-inflammatory drug (NSAID). [wiki]drug (NSAID). [wiki]
– 17 – CSCE 771 Spring 2013
Hearst's Patterns for IS-A extractingPatterns for <X IS-A Y>Patterns for <X IS-A Y>
““Y such as X”Y such as X”
““X or other Y”X or other Y”
““Y including X”Y including X”
““Y, especially X”Y, especially X”
– 18 – CSCE 771 Spring 2013
Extracting Richer RelationsExtracting Richer Relations Using Specific RulesExtracting Richer Relations Using Specific Rules
Intuition: relations that commonly hold: located-in, Intuition: relations that commonly hold: located-in, cures, ownscures, owns
What relations hold between two entitiesWhat relations hold between two entities
– 19 – CSCE 771 Spring 2013
Fig 22.16 Pattern and Bootstrapping
– 20 – CSCE 771 Spring 2013
Hand-built patterns for relationsHand-built patterns for relationsHand-built patterns for relations
ProsPros
ConsCons
– 21 – CSCE 771 Spring 2013
Supervised Relation Extraction
How to do Classification is supervise relation extractionHow to do Classification is supervise relation extraction
1 find all pairs of named entities1 find all pairs of named entities
2. decides if they are realted2. decides if they are realted
3,3,
– 22 – CSCE 771 Spring 2013
ACE- Automated Content Extraction• http://projects.ldc.upenn.edu/ace/http://projects.ldc.upenn.edu/ace/• Linguistic Data Consortium Linguistic Data Consortium • Entity Detection and Tracking (EDT) isEntity Detection and Tracking (EDT) is• Relation Detection and Characterization (RDC)Relation Detection and Characterization (RDC)• Event Detection and Characterization (EDC)Event Detection and Characterization (EDC)• 6 classes of relations 17 overall6 classes of relations 17 overall
– 23 – CSCE 771 Spring 2013
Word features for relation ExtractionWord features for relation ExtractionWord features for relation Extraction
Headwords of M1 and M2Headwords of M1 and M2
• Named Entity Type and Named Entity Type and • Mention Level Features for relation extractionMention Level Features for relation extraction
• name, pronoun, nominal
– 24 – CSCE 771 Spring 2013
Parse Features for Relation ExtractionParse Features for Relation ExtractionParse Features for Relation Extraction
base syntatic chuck seq from one to anotherbase syntatic chuck seq from one to another
constituent pathconstituent path
Dependency pathDependency path
– 25 – CSCE 771 Spring 2013
Gazeteer and trigger word features for relation extractionTrigger list fo Trigger list fo kinship relationskinship relations
Gazeteer: name-listGazeteer: name-list
– 26 – CSCE 771 Spring 2013
Evaluation of Supervised Relation Extraction
Evaluation of Supervised Relation ExtractionEvaluation of Supervised Relation Extraction• P/R/FP/R/F
SummarySummary
+ hgh accuracies+ hgh accuracies
- training set- training set
models are brittlemodels are brittle
don't generalize welldon't generalize well
– 27 – CSCE 771 Spring 2013
Semi-Supervised Relation Extraction
Seed-based or bootstrapping approaches to RESeed-based or bootstrapping approaches to RE
No training setNo training set
Can you … do anything?Can you … do anything?
BootsrappingBootsrapping
– 28 – CSCE 771 Spring 2013
Relation BootstrappingRelation Bootstrapping (Hearst 1992)Relation Bootstrapping (Hearst 1992)
Gather seed pairs of relation RGather seed pairs of relation R
iterate iterate
1.1. find sentences with pairs, find sentences with pairs,
2.2. look at context... look at context...
3.3. use patterns to search for more pairsuse patterns to search for more pairs
– 29 – CSCE 771 Spring 2013
Bootstrapping Example
– 30 – CSCE 771 Spring 2013
Extract <author, book> pairsDipre: start with seedsDipre: start with seeds
Find instancesFind instances
Extract patternsExtract patterns
Now iterateNow iterate
– 31 – CSCE 771 Spring 2013
Snowball Algorithm Agichtein, Gravano 2000Snowball Algorithm by Agichtein, Gravano 2000 Snowball Algorithm by Agichtein, Gravano 2000
Distant supervisionDistant supervision
Distant supervision paradigmDistant supervision paradigm
Like classifiedLike classified
– 32 – CSCE 771 Spring 2013
Unsupervised relation extraction
Banko et al 2007 “Open information extraction from the Banko et al 2007 “Open information extraction from the Web”Web”
Extracting relations from the web withExtracting relations from the web with• no training datano training data• no predetermined list of relationsno predetermined list of relations• The Open ApproachThe Open Approach
1.1. Use parse data to train a “trust-worthy” classifierUse parse data to train a “trust-worthy” classifier
2.2. Extract trustworthy relations among NPsExtract trustworthy relations among NPs
3.3. Rank relations based on text redundancyRank relations based on text redundancy
– 33 – CSCE 771 Spring 2013
Evaluation of Semi-supervised and Unsupervised REEvaluation of Semi-supervised and Unsupervised REEvaluation of Semi-supervised and Unsupervised RE
No gold std ... the web is not taggedNo gold std ... the web is not tagged• no way to compute precision or recallno way to compute precision or recall
Instead only estimate precisionInstead only estimate precision• draw sample check precision manually draw sample check precision manually • alternatively choose several levels of recall and alternatively choose several levels of recall and
check the precision therecheck the precision there
No way to check the recall?No way to check the recall?• randomly select text sample and manually checkrandomly select text sample and manually check
– 34 – CSCE 771 Spring 2013
NLTK Info. Extraction..
– 35 – CSCE 771 Spring 2013
NLTK ReviewNLTK 7.1-7.3NLTK 7.1-7.3
Chunking Example 7.4 (code_chunker1.py), chinking Example 7.5 (code_chinker.py) simple re_chunker Evaluation Example 7.8 (code_unigram_chunker.py) Example 7.9 (code_classifier_chunker.py
– 36 – CSCE 771 Spring 2013
Review 7.4: Simple Noun Phrase Chunker
grammar = r"""grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nounsdeterminer/possessive, adjectives and nouns
{<NNP>+} # chunk sequences of proper nouns{<NNP>+} # chunk sequences of proper nouns
""""""
cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]("golden", "JJ"), ("hair", "NN")]
print cp.parse(sentence)print cp.parse(sentence)
– 37 – CSCE 771 Spring 2013
(S(S
(NP Rapunzel/NNP)(NP Rapunzel/NNP)
let/VBDlet/VBD
down/RPdown/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))(NP her/PP$ long/JJ golden/JJ hair/NN))
– 38 – CSCE 771 Spring 2013
Review 7.5: Simple Noun Phrase Chinker
grammar = r"""grammar = r"""
NP:NP:
{<.*>+} # Chunk everything{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN}<VBD|IN>+{ # Chink sequences of VBD and IN
""""""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]"NN")]
cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)
print cp.parse(sentence)print cp.parse(sentence)
– 39 – CSCE 771 Spring 2013
>>> >>>
(S(S
(NP the/DT little/JJ yellow/JJ dog/NN)(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBDbarked/VBD
at/INat/IN
(NP the/DT cat/NN))(NP the/DT cat/NN))
>>> >>>
– 40 – CSCE 771 Spring 2013
RegExp Chunker – conll2000import nltkimport nltk
from nltk.corpus import conll2000from nltk.corpus import conll2000
cp = nltk.RegexpParser("")cp = nltk.RegexpParser("")
test_sents = conll2000.chunked_sents('test.txt', test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])chunk_types=['NP'])
print cp.evaluate(test_sents)print cp.evaluate(test_sents)
grammar = r"NP: {<[CDJNP].*>+}"grammar = r"NP: {<[CDJNP].*>+}"
cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)
print cp.evaluate(test_sents)print cp.evaluate(test_sents)
– 41 – CSCE 771 Spring 2013
ChunkParse score:ChunkParse score:
IOB Accuracy: 43.4%IOB Accuracy: 43.4%
Precision: Precision: 0.0%0.0%
Recall: Recall: 0.0%0.0%
F-Measure: F-Measure: 0.0%0.0%
ChunkParse score:ChunkParse score:
IOB Accuracy: IOB Accuracy: 87.7%87.7%
Precision: Precision: 70.6%70.6%
Recall: Recall: 67.8%67.8%
F-Measure: F-Measure: 69.2%69.2%
– 42 – CSCE 771 Spring 2013
Conference on Computational Natural Language LearningConference on Computational Natural Language Conference on Computational Natural Language
Learning (CoNLL-2000)Learning (CoNLL-2000)
http://www.cnts.ua.ac.be/conll2000/chunking/http://www.cnts.ua.ac.be/conll2000/chunking/
CoNLL 2013 : Seventeenth Conference on CoNLL 2013 : Seventeenth Conference on Computational Natural Language LearningComputational Natural Language Learning
– 43 – CSCE 771 Spring 2013
Evaluation Example 7.8 (code_unigram_chunker.py)
AttributeError: 'module' object has no attribute AttributeError: 'module' object has no attribute 'conlltags2tree''conlltags2tree'
– 44 – CSCE 771 Spring 2013
code_classifier_chunker.pycode_classifier_chunker.py
NLTK was unable to find the megam file!NLTK was unable to find the megam file!
Use software specific configuration paramaters or set Use software specific configuration paramaters or set the MEGAM environment variable.the MEGAM environment variable.
For more information, on megam, see:For more information, on megam, see:
<http://www.cs.utah.edu/~hal/megam/><http://www.cs.utah.edu/~hal/megam/>
– 45 – CSCE 771 Spring 2013
7.4 Recursion in Linguistic Structure
– 46 – CSCE 771 Spring 2013
code_cascaded_chunker
grammar = r"""grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NNNP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NPPP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their argumentsVP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VPCLAUSE: {<NP><VP>} # Chunk NP, VP
""""""
cp = nltk.RegexpParser(grammar)cp = nltk.RegexpParser(grammar)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),"NN"),
("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]
print cp.parse(sentence)print cp.parse(sentence)
– 47 – CSCE 771 Spring 2013
>>> >>>
(S(S
(NP Mary/NN)(NP Mary/NN)
saw/VBDsaw/VBD
(CLAUSE(CLAUSE
(NP the/DT cat/NN)(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
– 48 – CSCE 771 Spring 2013
A sentence having deeper nestingsentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"),
("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"),
("on", "IN"), ("the", "DT"), ("mat", "NN")]("on", "IN"), ("the", "DT"), ("mat", "NN")]
print cp.parse(sentence)print cp.parse(sentence)
(S(S
(NP John/NNP)(NP John/NNP)
thinks/VBZthinks/VBZ
(NP Mary/NN)(NP Mary/NN)
saw/VBDsaw/VBD
(CLAUSE(CLAUSE
(NP the/DT cat/NN)(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))
– 49 – CSCE 771 Spring 2013
Treesprint tree4[1] print tree4[1]
(VP chased (NP the rabbit)) (VP chased (NP the rabbit))
tree4[1].node tree4[1].node
'VP‘'VP‘
tree4.leaves() tree4.leaves()
['Alice', 'chased', 'the', 'rabbit'] ['Alice', 'chased', 'the', 'rabbit']
tree4[1][1][1]tree4[1][1][1]
‘‘rabbitt’rabbitt’
tree4.draw()tree4.draw()
– 50 – CSCE 771 Spring 2013
Trees - code_traverse.pydef traverse(t):def traverse(t): try:try: t.nodet.node except AttributeError:except AttributeError: print t,print t, else:else: # Now we know that t.node is defined# Now we know that t.node is defined print '(', t.node,print '(', t.node, for child in t:for child in t: traverse(child)traverse(child) print ')',print ')',
t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')
traverse(t)traverse(t)
– 51 – CSCE 771 Spring 2013
NLTK 7.5 Named Entity Recognitionsent = nltk.corpus.treebank.tagged_sents()[22] sent = nltk.corpus.treebank.tagged_sents()[22]
print nltk.ne_chunk(sent, binary=True)print nltk.ne_chunk(sent, binary=True)
– 52 – CSCE 771 Spring 2013