relation extraction - sameer singhsameersingh.org/courses/...relation-extraction.pdf · relation...
TRANSCRIPT
RelationExtraction
Prof.SameerSinghCS295:STATISTICALNLP
WINTER2017
February23,2017
BasedonslidesfromDanJurafski,ChrisManning,andeveryoneelsetheycopiedfrom.
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 2
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 3
KnowledgeExtractionJohn was born in Liverpool, to Julia and Alfred Lennon.
Text
JohnLennon
AlfredLennon
JuliaLennon
Liverpoolbirthplace
childOf
childOf
LiteralFacts
CS295:STATISTICALNLP(WINTER2017) 4
RelationExtractionCompanyreport: “InternationalBusinessMachinesCorporation(IBMorthecompany)wasincorporatedintheStateofNewYorkonJune16,1911,astheComputing-Tabulating-RecordingCo.(C-T-R)…”
ExtractedComplexRelation:Company-Founding
Company IBMLocation NewYorkDate June16,1911Original-Name Computing-Tabulating-RecordingCo.
ButwewillfocusonthesimplertaskofextractingrelationtriplesFounding-year(IBM,1911)Founding-location(IBM,New York)
CS295:STATISTICALNLP(WINTER2017) 5
ExtractingRelationTriplesTheLelandStanfordJuniorUniversity,commonlyreferredtoasStanfordUniversityorStanford,isanAmericanprivateresearchuniversitylocatedinStanford,California …nearPaloAlto,California…LelandStanford…foundedtheuniversityin1891
Stanford EQ Leland Stanford Junior UniversityStanford LOC-IN CaliforniaStanford IS-A research universityStanford LOC-NEAR Palo AltoStanford FOUNDED-IN 1891Stanford FOUNDER Leland Stanford
CS295:STATISTICALNLP(WINTER2017) 6
NewsDomainROLE:relatesapersontoanorganizationorageopoliticalentity◦ subtypes:member,owner,affiliate,client,citizen
PART:generalizedcontainment◦ subtypes:subsidiary,physicalpart-of,setmembership
AT:permanentandtransientlocations◦ subtypes:located,based-in,residence
SOCIAL:socialrelationsamongpersons◦ subtypes:parent,sibling,spouse,grandparent,associate
CS295:STATISTICALNLP(WINTER2017) 7
AutomatedContentExtraction
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
CS295:STATISTICALNLP(WINTER2017) 8
ACERelationsExamples
Physical-LocatedPER-GPEHe was in Tennessee
Part-Whole-SubsidiaryORG-ORGXYZ, the parent company of ABC
Person-Social-FamilyPER-PERJohn’s wife Yoko
Org-AFF-FounderPER-ORGSteve Jobs, co-founder of Apple…
CS295:STATISTICALNLP(WINTER2017) 9
GeographicalRelations
CS295:STATISTICALNLP(WINTER2017) 10
MedicalRelationsUMLSResource
CS295:STATISTICALNLP(WINTER2017) 11
MedicalRelations
Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in
patients with type 2 diabetes
ê
Echocardiography,DopplerDIAGNOSES Acquiredstenosis
CS295:STATISTICALNLP(WINTER2017) 12
FreebaseRelations
Thousandsofrelationsandmillionsofinstances!ManuallycreatedfrommultiplesourcesincludingWikipediaInfoBoxes
CS295:STATISTICALNLP(WINTER2017) 13
OntologicalRelations
IS-A(hypernym):subsumption betweenclasses◦ Giraffe IS-Aruminant IS-A ungulate IS-A mammalIS-Avertebrate IS-Aanimal…
Instance-of:relationbetweenindividualandclass◦ San Francisco instance-ofcity
CS295:STATISTICALNLP(WINTER2017) 14
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 15
RulesforIS-ARelation
EarlyintuitionfromHearst(1992)“Agarisasubstancepreparedfrom
amixtureofredalgae,suchasGelidium,forlaboratoryorindustrialuse”
WhatdoesGelidium mean?
Howdoyouknow?
CS295:STATISTICALNLP(WINTER2017) 16
Hearst’sPatternsforIS-Arelations
Hearst(1992):AutomaticAcquisitionofHyponyms
“Y such as X ((, X)* (, and|or) X)”“such Y as X”“X or other Y”“X and other Y”“Y including X”“Y, especially X”
CS295:STATISTICALNLP(WINTER2017) 17
Hearst’sPatternsforIS-Arelations
Hearstpattern ExampleoccurrencesXandother Y ...temples,treasuries,andotherimportantcivicbuildings.
XorotherY Bruises,wounds,brokenbonesorotherinjuries...
YsuchasX Thebowlute,suchastheBambarandang...
Such YasX ...such authorsas Herrick,Goldsmith,andShakespeare.
YincludingX ...common-lawcountries,including CanadaandEngland...
Y,especiallyX Europeancountries,especially France,England,andSpain...
CS295:STATISTICALNLP(WINTER2017) 18
ExtractingRicherRelations
Intuition:Relationsoftenholdbetweenspecifictypesofentities◦ located-in(ORGANIZATION,LOCATION)◦ founded (PERSON,ORGANIZATION)◦ cures (DRUG,DISEASE)
StartwithNamedEntitytagstoextractrelation!
CS295:STATISTICALNLP(WINTER2017) 19
EntityTypesaren’tenough
Drug Disease
Cure?Prevent?
Cause?
Whichrelationsholdbetween2entities?
CS295:STATISTICALNLP(WINTER2017) 20
Whichrelationsholdbetweentwoentities?
PERSON ORGANIZATION
Founder?
Investor?
Member?
Employee?
President?
CS295:STATISTICALNLP(WINTER2017) 21
ExtractingRicherRelationsUsingRulesandNamedEntities
Whoholdswhatofficeinwhatorganization?
PERSON, POSITIONof ORG◦ GeorgeMarshall,SecretaryofStateoftheUnitedStates
PERSON(named|appointed|chose|etc.) PERSON Prep?POSITION◦ TrumanappointedMarshallSecretaryofState
PERSON [be]?(named|appointed|etc.)Prep?ORG POSITION◦ GeorgeMarshallwasnamedUSSecretaryofState
CS295:STATISTICALNLP(WINTER2017) 22
ComplexSurfacePatternsCombinetokens,dependencypaths,andentitytypestodefinerules.
Argument1 Argument2,Person Organization
DT CEO of
appos nmod
casedet
BillGates,theCEOofMicrosoft,said…Mr.Jobs,thebrilliantandcharmingCEOofAppleInc.,said…… announcedbySteveJobs,theCEOofApple.… announcedbyBillGates,thedirectorandCEOofMicrosoft.… musedBill,aformerCEOofMicrosoft.andmanyotherpossibleinstantiations…
CS295:STATISTICALNLP(WINTER2017) 23
Rule-BasedExtraction
UseacollectionofrulesasthesystemitselfArgument1 Argument2,Person Organization
DT CEO of
appos nmod
casedet Implies Argument1 Argument2headOf
Source:• Manuallyspecified• LearnedfromDataMultipleRules:• Attachpriorities/precedence• Attachprobabilities(morelater)
Varia
tions
CS295:STATISTICALNLP(WINTER2017) 24
Hand-builtpatternsforrelations
◦ Humanpatternstendtobehigh-precision◦ Canbetailoredtospecificdomains◦ Easytodebug:whyapredictionwasmade,howtofix?
Pluses
◦ Humanpatternsareoftenlow-recall◦ Alotofworktothinkofallpossiblepatterns!◦ Don’twanttohavetodothisforeveryrelation!◦ We’dlikebetteraccuracy(generalization)
Minuses
CS295:STATISTICALNLP(WINTER2017) 25
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 26
SupervisedMachineLearningChooseasetofrelationswe’dliketoextractChooseasetofrelevantnamedentitiesFindandlabeldata◦ Choosearepresentativecorpus◦ Labelthenamedentitiesinthecorpus◦ Hand-labeltherelationsbetweentheseentities◦ Breakintotraining,development,andtest
Trainaclassifieronthetrainingset
CS295:STATISTICALNLP(WINTER2017) 27
AutomatedContentExtraction
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
ACE2008“RelationExtractionTask” CS295:STATISTICALNLP(WINTER2017) 28
RelationExtractionClassifytherelationbetweentwoentitiesinasentence
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
SUBSIDIARY
FAMILYEMPLOYMENT
NIL
FOUNDER
CITIZEN
INVENTOR…
CS295:STATISTICALNLP(WINTER2017) 29
WordFeaturesforRelationExtraction
HeadwordsofM1andM2,andcombinationAirlinesWagnerAirlines-Wagner
BagofwordsandbigramsinM1andM2
{American,Airlines,Tim,Wagner,AmericanAirlines,TimWagner}
WordsorbigramsinparticularpositionsleftandrightofM1/M2M2:-1spokesmanM2:+1said
Bagofwordsorbigramsbetweenthetwoentities{a,AMR,of,immediately,matched,move,spokesman,the,unit}
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
CS295:STATISTICALNLP(WINTER2017) 30
NamedEntityTypeandMentionLevelFeatures
Named-entitytypes◦ M1:ORG◦ M2:PERSON
Concatenationofthetwonamed-entitytypes◦ ORG-PERSON
EntityLevelofM1andM2 (NAME,NOMINAL,PRONOUN)◦ M1:NAME [itor he wouldbePRONOUN]◦ M2:NAME [thecompanywouldbeNOMINAL]
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
CS295:STATISTICALNLP(WINTER2017) 31
DependencyParseFeaturesforRelationExtraction
BasesyntacticchunksequencefromonetotheotherNPNPPPVPNPNP
ConstituentpaththroughthetreefromonetotheotherNPé NPé Sé Sê NP
Dependencypath
AirlinesmatchedWagnersaid
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaidMention1 Mention2
CS295:STATISTICALNLP(WINTER2017) 32
Gazeteer andTriggerwordfeaturesforrelationextraction
Triggerlistforfamily:kinshipterms◦ parent,wife,husband,grandparent,etc.[fromWordNet]
Gazeteer:◦ Listsofusefulgeoorgeopoliticalwords◦ Countrynamelist◦ Othersub-entities
CS295:STATISTICALNLP(WINTER2017) 33
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
CS295:STATISTICALNLP(WINTER2017) 34
SupervisedExtractionMachineLearning:hopefully,generalizesthelabelsintherightway
UseallofNLPasfeatures:words,POS,NER,dependencies,embeddings
However
Usually,alotoflabeleddata isneeded,whichisexpensive&timeconsuming.Requiresalotoffeatureengineering!
Classifier
P(birthplace)= 0.75
JohnwasborninLiverpool,toJuliaandAlfredLennon.
FeatureEngineering
…NER DepPath Textinb/w embeddingsPOS
CS295:STATISTICALNLP(WINTER2017) 35
SupervisedRelationExtraction
◦ Cangethighaccuraciesifenoughtrainingdata◦ Iftestsimilarenoughtotraining◦ CanutilizeanumberofNLPtasks
Pluses
◦ Labelingalargetrainingsetisexpensive◦ Supervisedmodelsarebrittle,don’tgeneralizewelltodifferentgenres
Minuses
CS295:STATISTICALNLP(WINTER2017) 36
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 37
Seed-basedorbootstrappingapproachestorelationextraction
Notrainingset?Maybeyouhave:◦ Afewseedtuplesor◦ Afewhigh-precisionpatterns
Canyouusethoseseedstodosomethinguseful?◦ Bootstrapping:usetheseedstodirectlylearnarelation
CS295:STATISTICALNLP(WINTER2017) 38
RelationBootstrapping
Gatherasetofseedpairsthathavetherelation1. Findsentenceswiththesepairs2. Lookatthecontextbetweenoraroundthe
pairandgeneralizethecontexttocreatepatterns
3. Usethepatternstogathermorepairs4. Repeat
CS295:STATISTICALNLP(WINTER2017) 39
BootstrappingExample<MarkTwain,Elmira>Seedtupleod“diedin”
Lookfortheenvironmentsoftheseedtuple
“MarkTwainisburiedinElmira,NY.”XisburiedinY
“ThegraveofMarkTwainisinElmira”ThegraveofXisinY
“ElmiraisMarkTwain’sfinalrestingplace”YisX’sfinalrestingplace.
Usethosepatternstofindnewtuples
Repeat
CS295:STATISTICALNLP(WINTER2017) 40
Dipre:Extract<author,book>pairsStartwith5seeds:
FindInstancesontheWeb:TheComedyofErrors,by WilliamShakespeare,wasTheComedyofErrors,byWilliamShakespeare,isTheComedyofErrors,oneofWilliamShakespeare'searliestattemptsTheComedyofErrors,oneofWilliamShakespeare'smost
Extractpatterns(groupbymiddle,takelongestcommonprefix/suffix)?x , by ?y , ?x , one of ?y ‘s
Nowiterate,findingnewseedsthatmatchthepattern
Author BookIsaacAsimov TheRobots ofDawnDavidBrin Startide RisingJamesGleick Chaos:MakingaNewScienceCharlesDickens GreatExpectationsWilliamShakespeare TheComedyofErrors
Brin,Sergei.1998.ExtractingPatterns… CS295:STATISTICALNLP(WINTER2017) 41
SnowballSimilariterativealgorithm
Groupinstancesw/similarprefix,middle,suffix,extractpatterns◦ ButrequirethatXandYbenamedentities◦ Andcomputeaconfidenceforeachpattern
{’s, in, headquarters}
{in, based} ORGANIZATIONLOCATION
Organization LocationofHeadquartersMicrosoft RedmondExxon IrvingIBM Armonk
ORGANIZATION LOCATION .69
.75
E.Agichtein andL.Gravano,ICDL(2000) CS295:STATISTICALNLP(WINTER2017) 42
DistantSupervision
Combinebootstrappingwithsupervisedlearning◦ Insteadof5(orjustafew)seeds,◦ Usealargedatabasetogethuge#ofseedexamples
◦ Createlotsoffeaturesfromalltheseexamples◦ Combineinasupervisedclassifier
Snow,Jurafsky,Ng(2005),Wu&Weld(2007),Mintz,Bills,Snow,Jurafsky (2009) CS295:STATISTICALNLP(WINTER2017) 43
DistantlySupervisedlearningofrelationextractionpatterns
Foreachrelation
Foreachtupleinbigdatabase
Findsentencesinlargecorpuswithbothentities
Extractfrequentfeatures(parse, words,etc)
Trainsupervisedclassifierusingthesepatterns
4
1
2
3
5
PERwasborninLOCPER,born(XXXX),LOCPER’sbirthplaceinLOC
<EdwinHubble,Marshfield><AlbertEinstein,Ulm>
Born-In
HubblewasborninMarshfieldEinstein,born(1879),UlmHubble’sbirthplaceinMarshfield
P(born-in | f1,f2,f3,…,f70000)
CS295:STATISTICALNLP(WINTER2017) 44
DistantSupervisionParadigm
Likesupervisedclassification:◦ Usesaclassifierwithlotsoffeatures◦ Supervisedbydetailedhand-createdknowledge◦ Doesn’trequireiterativelyexpandingpatterns
Likeunsupervisedclassification:◦ Usesverylargeamountsofunlabeleddata◦ Notsensitivetogenreissuesintrainingcorpus
CS295:STATISTICALNLP(WINTER2017) 45
UnsupervisedRelationExtraction
OpenInformationExtraction:◦ extractrelationsfromthewebwithnotrainingdata,nolistofrelations
1. Useparseddatatotraina“trustworthytuple”classifier
2. Single-passextractallrelationsbetweenNPs,keepiftrustworthy
3. Assessorranksrelationsbasedontextredundancy(FCI,specializesin,softwaredevelopment)
(Tesla,invented,coiltransformer)
Banko,Cararella,Soderland,Broadhead,Etzioni.2007 CS295:STATISTICALNLP(WINTER2017) 46
EvaluationofSemi-supervisedandUnsupervisedRelationExtraction
Sinceitextractstotallynewrelationsfromtheweb◦ Thereisnogoldsetofcorrectinstancesofrelations!◦ Can’tcomputeprecision(don’tknowwhichonesarecorrect)◦ Can’tcomputerecall(don’tknowwhichonesweremissed)
Instead,wecanapproximateprecision(only)◦ Drawarandomsampleofrelationsfromoutput,checkprecisionmanually
Canalsocomputeprecisionatdifferentlevelsofrecall.◦ Precisionfortop1000newrelations,top10,000newrelations,top100,000◦ Ineachcasetakingarandomsampleofthatset
Butnowaytoevaluaterecall
P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample
CS295:STATISTICALNLP(WINTER2017) 47
Outline
IntroductiontoRelationExtraction
Hand-writtenPatterns
SupervisedMachineLearning
SemiandUnsupervisedLearning
CS295:STATISTICALNLP(WINTER2017) 48
Upcoming…
• Homework3isdueonFebruary27• Write-upanddatahasbeenreleased.Homework
• Statusreportduein1.5weeks:March2,2017• Instructionscomingsoon• Only5pages
Project
• Papersummaries:February28,March14• Only1 pageeachSummaries
CS295:STATISTICALNLP(WINTER2017) 49