l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/fall17/l17_rel_extraction.pdf• words...

41
Relation Extraction What is relation extraction? Many slides adapted from Dan Jurafsky

Upload: others

Post on 11-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Whatisrelationextraction?

ManyslidesadaptedfromDanJurafsky

Page 2: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Extractingrelationsfromtext• Companyreport: “InternationalBusinessMachinesCorporation(IBMor

thecompany)wasincorporatedintheStateofNewYorkonJune16,1911,astheComputing-Tabulating-RecordingCo.(C-T-R)…”

• ExtractedComplexRelation:Company-Founding

Company IBMLocation NewYorkDate June16,1911Original-Name Computing-Tabulating-RecordingCo.

• ButwewillfocusonthesimplertaskofextractingrelationtriplesFounding-year(IBM,1911)Founding-location(IBM,New York)

Page 3: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

WhyRelationExtraction?

• Createnewstructuredknowledgebases,usefulforanyapp• Augmentcurrentknowledgebases

• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia

• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

• Butwhichrelationsshouldweextract?

3

Page 4: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

AutomaticContentExtraction(ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

17 relations from 2008 “Relation Extraction Task”

Page 5: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

AutomaticContentExtraction(ACE)

• Physical-LocatedPER-GPEHe was in Tennessee

• Part-Whole-SubsidiaryORG-ORGXYZ, the parent company of ABC

• Person-Social-FamilyPER-PERJohn’s wife Yoko

• Org-AFF-FounderPER-ORGSteve Jobs, co-founder of Apple…

•5

Page 6: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

UMLS:UnifiedMedicalLanguageSystem

• 134entitytypes,54relations

Injury disrupts PhysiologicalFunctionBodilyLocation location-of BiologicFunctionAnatomicalStructure part-of OrganismPharmacologicSubstancecauses PathologicalFunctionPharmacologicSubstancetreats PathologicFunction

Page 7: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

DatabasesofWikipedia Relations

7

RelationsextractedfromInfoboxStanfordstateCaliforniaStanfordmotto “DieLuft derFreiheit weht”…

WikipediaInfobox

Page 8: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Howtobuildrelationextractors

1. Hand-writtenpatterns2. Supervisedmachinelearning3. Semi-supervisedandunsupervised• Bootstrapping(usingseeds)• Distantsupervision• Unsupervisedlearningfromtheweb

Page 9: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Whatisrelationextraction?

Page 10: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Usingpatternstoextractrelations

Page 11: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

ExtractingRicherRelationsUsingRules

• Intuition:relationsoftenholdbetweenspecificentities• located-in(ORGANIZATION,LOCATION)• founded (PERSON,ORGANIZATION)• cures(DRUG,DISEASE)

• StartwithNamedEntitytagstohelpextractrelation!

Page 12: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

NamedEntitiesaren’tquiteenough.Whichrelationsholdbetween2entities?

Drug Disease

Cure?Prevent?

Cause?

Page 13: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Whatrelationsholdbetween2entities?

PERSON ORGANIZATION

Founder?

Investor?

Member?

Employee?

President?

Page 14: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

ExtractingRicherRelationsUsingRulesandNamedEntities

Whoholdswhatofficeinwhatorganization?PERSON, POSITION of ORG

• GeorgeMarshall,SecretaryofStateoftheUnitedStates

PERSON(named|appointed|chose|etc.) PERSON Prep?POSITION• TrumanappointedMarshallSecretaryofState

PERSON [be]?(named|appointed|etc.)Prep?ORG POSITION• GeorgeMarshallwasnamedUSSecretaryofState

Page 15: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Hand-builtpatternsforrelations• Plus:• Humanpatternstendtobehigh-precision• Canbetailoredtospecificdomains

• Minus• Humanpatternsareoftenlow-recall• Alotofworktothinkofallpossiblepatterns!• Don’twanttohavetodothisforeveryrelation!• We’dlikebetteraccuracy

Page 16: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Usingpatternstoextractrelations

Page 17: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Supervisedrelationextraction

Page 18: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Supervisedmachinelearningforrelations

• Chooseasetofrelationswe’dliketoextract• Chooseasetofrelevantnamedentities• Findandlabeldata

• Choosearepresentativecorpus• Labelthenamedentitiesinthecorpus• Hand-labeltherelationsbetweentheseentities• Breakintotraining,development,andtest

• Trainaclassifieronthetrainingset18

Page 19: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Howtodoclassificationinsupervisedrelationextraction

1. Findallpairsofnamedentities(usuallyinsamesentence)

2. Decideif2entitiesarerelated3. Ifyes,classifytherelation• Whytheextrastep?

• Fasterclassificationtrainingbyeliminatingmostpairs• Canusedistinctfeature-setsappropriateforeachtask.

19

Page 20: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

AutomatedContentExtraction(ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

17 sub-relations of 6 relations from 2008 “Relation Extraction Task”

Page 21: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Classifytherelationbetweentwoentitiesinasentence

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.

SUBSIDIARY

FAMILYEMPLOYMENT

NIL

FOUNDER

CITIZEN

INVENTOR…

Page 22: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

WordFeaturesforRelationExtraction

• HeadwordsofM1andM2,andcombinationAirlinesWagnerAirlines-Wagner

• BagofwordsandbigramsinM1andM2{American,Airlines,Tim,Wagner,AmericanAirlines,TimWagner}

• WordsorbigramsinparticularpositionsleftandrightofM1/M2M2:-1spokesmanM2:+1said

• Bagofwordsorbigramsbetweenthetwoentities{a,AMR,of,immediately,matched,move,spokesman,the,unit}

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2

Page 23: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

NamedEntityTypeandMentionLevelFeaturesforRelationExtraction

• Named-entitytypes• M1:ORG• M2:PERSON

• Concatenationofthetwonamed-entitytypes• ORG-PERSON

• EntityLevelofM1andM2 (NAME,NOMINAL,PRONOUN)• M1:NAME [itor hewouldbePRONOUN]• M2:NAME [thecompanywouldbeNOMINAL]

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2

Page 24: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

ParseFeaturesforRelationExtraction

• BasesyntacticchunksequencefromonetotheotherNPNPPPVPNPNP

• ConstituentpaththroughthetreefromonetotheotherNPé NPé Sé Sê NP

• DependencypathAirlinesmatchedWagnersaid

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2

Page 25: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Gazeteer andtriggerwordfeaturesforrelationextraction

• Triggerlistforfamily:kinshipterms• parent,wife,husband,grandparent,etc.[fromWordNet]

• Gazeteer:• Listsofusefulgeoorgeopoliticalwords• Countrynamelist• Othersub-entities

Page 26: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.

Page 27: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Classifiersforsupervisedmethods

• Nowyoucanuseanyclassifieryoulike• MaxEnt• NaïveBayes• SVM• ...

• Trainitonthetrainingset,tuneonthedev set,testonthetestset

Page 28: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

EvaluationofSupervisedRelationExtraction

• ComputeP/R/F1 foreachrelation

28

P = # of correctly extracted relationsTotal # of extracted relations

R = # of correctly extracted relationsTotal # of gold relations

F1 =2PRP + R

Page 29: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Summary:SupervisedRelationExtraction

+ Cangethighaccuracieswithenoughhand-labeledtrainingdata,iftestsimilarenoughtotraining

- Labelingalargetrainingsetisexpensive

- Supervisedmodelsarebrittle,don’tgeneralizewelltodifferentgenres

Page 30: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Supervisedrelationextraction

Page 31: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Semi-supervisedandunsupervisedrelationextraction

Page 32: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Seed-basedorbootstrappingapproachestorelationextraction

• Notrainingset?Maybeyouhave:• Afewseedtuples or• Afewhigh-precisionpatterns

• Canyouusethoseseedstodosomethinguseful?• Bootstrapping:usetheseedstodirectlylearntopopulatearelation

Page 33: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationBootstrapping(Hearst1992)

• GatherasetofseedpairsthathaverelationR• Iterate:1. Findsentenceswiththesepairs2. Lookatthecontextbetweenoraroundthepair

andgeneralizethecontexttocreatepatterns3. Usethepatternsforgrep formorepairs

Page 34: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Bootstrapping• <MarkTwain,Elmira>Seedtuple

• Grep (google)fortheenvironmentsoftheseedtuple“MarkTwainisburiedinElmira,NY.”

XisburiedinY“ThegraveofMarkTwainisinElmira”

ThegraveofXisinY“ElmiraisMarkTwain’sfinalrestingplace”

YisX’sfinalrestingplace.

• Usethosepatternstogrep fornewtuples• Iterate

Page 35: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Dipre:Extract<author,book>pairs

• Startwith5seeds:

• FindInstances:TheComedyofErrors,by WilliamShakespeare,wasTheComedyofErrors,byWilliamShakespeare,isTheComedyofErrors,oneofWilliamShakespeare'searliestattemptsTheComedyofErrors,oneofWilliamShakespeare'smost

• Extractpatterns(groupbymiddle,takelongestcommonprefix/suffix)?x , by ?y , ?x , one of ?y ‘s

• Nowiterate,findingnewseedsthatmatchthepattern

Brin, Sergei. 1998. Extracting Patterns and Relations from the World Wide Web.

Author BookIsaacAsimov TheRobots ofDawnDavidBrin Startide RisingJamesGleick Chaos:MakingaNewScienceCharlesDickens GreatExpectationsWilliamShakespeare TheComedyofErrors

Page 36: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

DistantSupervision

• Combinebootstrappingwithsupervisedlearning• Insteadof5seeds,• Usealargedatabasetogethuge#ofseedexamples

• Createlotsoffeaturesfromalltheseexamples• Combineinasupervisedclassifier

Snow,Jurafsky,Ng.2005.Learningsyntacticpatternsforautomatichypernym discovery.NIPS17Fei WuandDanielS.Weld.2007.AutonomouslySemantifying Wikipeida.CIKM2007Mintz,Bills,Snow,Jurafsky.2009.Distantsupervisionforrelationextractionwithoutlabeleddata.ACL09

Page 37: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Distantsupervisionparadigm

• Likesupervisedclassification:• Usesaclassifierwithlotsoffeatures• Supervisedbydetailedhand-createdknowledge• Doesn’trequireiterativelyexpandingpatterns

• Likeunsupervisedclassification:• Usesverylargeamountsofunlabeleddata• Notsensitivetogenreissuesintrainingcorpus

Page 38: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Distantlysupervisedlearningofrelationextractionpatterns

Foreachrelation

Foreachtupleinbigdatabase

Findsentencesinlargecorpuswithbothentities

Extractfrequentfeatures(parse, words,etc)

Trainsupervisedclassifierusingthousandsoffeatures

4

1

2

3

5

PERwasborninLOCPER,born(XXXX),LOCPER’s birthplaceinLOC

<EdwinHubble,Marshfield><AlbertEinstein,Ulm>

Born-In

Hubble wasborninMarshfieldEinstein,born(1879),UlmHubble’sbirthplaceinMarshfield

P(born-in | f1,f2,f3,…,f70000)

Page 39: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

Unsupervisedrelationextraction

• OpenInformationExtraction:• extractrelationsfromthewebwithnotrainingdata,nolistofrelations

1. Useparseddatatotraina“trustworthytuple”classifier2. Single-passextractallrelationsbetweenNPs,keepiftrustworthy3. Assessorranksrelationsbasedontextredundancy

(FCI,specializesin,softwaredevelopment)

(Tesla,invented,coiltransformer)39

M.Banko,M.Cararella,S.Soderland,M.Broadhead,andO.Etzioni.2007.Openinformationextractionfromtheweb.IJCAI

Page 40: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

EvaluationofSemi-supervisedandUnsupervisedRelationExtraction

• Sinceitextractstotallynewrelationsfromtheweb• Thereisnogoldsetofcorrectinstancesofrelations!

• Can’tcomputeprecision(don’tknowwhichonesarecorrect)• Can’tcomputerecall(don’tknowwhichonesweremissed)

• Instead,wecanapproximateprecision(only)• Drawarandomsampleofrelationsfromoutput,checkprecisionmanually

• Canalsocomputeprecisionatdifferentlevelsofrecall.• Precisionfortop1000newrelations,top10,000newrelations,top100,000• Ineachcasetakingarandomsampleofthatset

• Butnowaytoevaluaterecall40

P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample

Page 41: l17 rel extraction - ecology labfaculty.cse.tamu.edu/huangrh/Fall17/l17_rel_extraction.pdf• Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman M2:

RelationExtraction

Semi-supervisedandunsupervisedrelationextraction