relation extraction - github pagesaritter.github.io/courses/5525_slides/relation_extraction.pdf ·...

Post on 14-Aug-2019

243 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RelationExtraction

ManyslidesfromDanJurafsky

Extractingrelationsfromtext

• Companyreport: “InternationalBusinessMachinesCorporation(IBMorthecompany)wasincorporatedintheStateofNewYorkonJune16,1911,astheComputing-Tabulating-RecordingCo.(C-T-R)…”• ExtractedComplexRelation:

Company-FoundingCompany IBMLocation NewYorkDate June16,1911Original-Name Computing-Tabulating-RecordingCo.

• ButwewillfocusonthesimplertaskofextractingrelationtriplesFounding-year(IBM,1911)Founding-location(IBM,New York)

ExtractingRelationTriplesfromTextTheLelandStanfordJuniorUniversity,commonlyreferredtoasStanfordUniversityorStanford,isanAmericanprivateresearchuniversitylocatedinStanford,California …nearPaloAlto,California…LelandStanford…foundedtheuniversityin1891

Stanford EQLelandStanfordJuniorUniversityStanford LOC-INCaliforniaStanford IS-Aresearch universityStanford LOC-NEARPaloAltoStanford FOUNDED-IN1891StanfordFOUNDERLelandStanford

WhyRelationExtraction?

• Createnewstructuredknowledgebases,usefulforanyapp• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia

• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

• Butwhichrelationsshouldweextract?

4

AutomatedContentExtraction(ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

17relationsfrom2008“RelationExtractionTask”

AutomatedContentExtraction(ACE)

• Physical-LocatedPER-GPEHe was in Tennessee

• Part-Whole-SubsidiaryORG-ORGXYZ, the parent company of ABC

• Person-Social-FamilyPER-PERJohn’s wife Yoko

• Org-AFF-FounderPER-ORGSteve Jobs, co-founder of Apple…

6

UMLS:UnifiedMedicalLanguageSystem

• 134entitytypes,54relations

Injury disrupts PhysiologicalFunctionBodilyLocation location-of BiologicFunctionAnatomicalStructure part-of OrganismPharmacologicSubstancecauses PathologicalFunctionPharmacologicSubstancetreats PathologicFunction

ExtractingUMLSrelationsfromasentence

Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes

ê

Echocardiography,DopplerDIAGNOSES Acquiredstenosis

8

DatabasesofWikipedia Relations

9

RelationsextractedfromInfoboxStanfordstateCaliforniaStanfordmotto “DieLuft derFreiheit weht”…

WikipediaInfobox

RelationdatabasesthatdrawfromWikipedia

• ResourceDescriptionFramework(RDF)triplessubjectpredicate objectGolden Gate Park location San Franciscodbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco

• DBPedia:1billionRDFtriples,385fromEnglishWikipedia• FrequentFreebaserelations:

people/person/nationality,location/location/containspeople/person/profession,people/person/place-of-birthbiology/organism_higher_classification film/film/genre

10

Ontologicalrelations

• IS-A(hypernym):subsumption betweenclasses•Giraffe IS-Aruminant IS-A ungulate IS-Amammal IS-Avertebrate IS-Aanimal…

• Instance-of:relationbetween individual andclass•San Francisco instance-of city

ExamplesfromtheWordNet Thesaurus

Howtobuildrelationextractors

1. Hand-writtenpatterns2. Supervisedmachinelearning3. Semi-supervisedandunsupervised• Bootstrapping(usingseeds)• Distantsupervision• Unsupervisedlearningfromtheweb

RelationExtraction

Whatisrelationextraction?

RelationExtraction

Usingpatternstoextractrelations

RulesforextractingIS-Arelation

EarlyintuitionfromHearst(1992)• “Agarisasubstancepreparedfromamixtureofredalgae,suchasGelidium, forlaboratoryorindustrial use”

• WhatdoesGelidium mean?• Howdoyouknow?`

RulesforextractingIS-Arelation

EarlyintuitionfromHearst(1992)• “Agarisasubstancepreparedfromamixtureofredalgae,suchasGelidium,forlaboratoryorindustrial use”

• WhatdoesGelidium mean?• Howdoyouknow?`

Hearst’sPatternsforextractingIS-Arelations

(Hearst,1992):AutomaticAcquisitionofHyponyms

“Y such as X ((, X)* (, and|or) X)”“such Y as X”“X or other Y”“X and other Y”“Y including X”“Y, especially X”

Hearst’sPatternsforextractingIS-Arelations

Hearstpattern ExampleoccurrencesXandother Y ...temples,treasuries,andotherimportantcivicbuildings.

XorotherY Bruises,wounds,brokenbonesorotherinjuries...

YsuchasX Thebowlute,suchastheBambarandang...

Such YasX ...such authorsas Herrick,Goldsmith,andShakespeare.

YincludingX ...common-lawcountries,including CanadaandEngland...

Y,especiallyX Europeancountries,especially France,England,andSpain...

ExtractingRicherRelationsUsingRules

• Intuition: relationsoftenholdbetweenspecificentities• located-in(ORGANIZATION, LOCATION)• founded (PERSON,ORGANIZATION)• cures(DRUG,DISEASE)

• StartwithNamedEntitytagstohelpextractrelation!

NamedEntitiesaren’tquiteenough.Whichrelationsholdbetween2entities?

Drug Disease

Cure?Prevent?

Cause?

Whatrelationsholdbetween2entities?

PERSON ORGANIZATION

Founder?

Investor?

Member?

Employee?

President?

ExtractingRicherRelationsUsingRulesandNamedEntities

Whoholdswhatofficeinwhatorganization?PERSON, POSITION of ORG

• GeorgeMarshall,SecretaryofStateoftheUnitedStates

PERSON(named|appointed|chose|etc.) PERSON Prep?POSITION• TrumanappointedMarshallSecretaryofState

PERSON [be]?(named|appointed|etc.) Prep?ORG POSITION• GeorgeMarshallwasnamedUSSecretaryofState

Hand-builtpatternsforrelations• Plus:•Humanpatternstendtobehigh-precision• Canbetailoredtospecificdomains

•Minus•Humanpatternsareoftenlow-recall•Alotofworktothinkofallpossiblepatterns!•Don’twanttohavetodothisforeveryrelation!•We’dlikebetteraccuracy

RelationExtraction

Usingpatternstoextractrelations

RelationExtraction

Supervisedrelationextraction

Supervisedmachinelearningforrelations

• Chooseasetofrelationswe’dliketoextract• Chooseasetofrelevantnamedentities• Findandlabeldata• Choosearepresentativecorpus• Labelthenamedentitiesinthecorpus• Hand-labeltherelationsbetweentheseentities• Breakintotraining,development,andtest

• Trainaclassifieronthetrainingset

26

Howtodoclassificationinsupervisedrelationextraction

1. Findallpairsofnamedentities(usuallyinsamesentence)

2. Decideif2entitiesarerelated3. Ifyes,classifytherelation•Whytheextrastep?• Fasterclassificationtrainingbyeliminatingmostpairs• Canusedistinctfeature-setsappropriateforeachtask.

27

AutomatedContentExtraction(ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

17sub-relationsof6relationsfrom2008“RelationExtractionTask”

RelationExtraction

Classifytherelationbetweentwoentities inasentence

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman

TimWagnersaid.

SUBSIDIARY

FAMILYEMPLOYMENT

NIL

FOUNDER

CITIZEN

INVENTOR…

WordFeaturesforRelationExtraction

• HeadwordsofM1andM2,andcombinationAirlinesWagnerAirlines-Wagner

• BagofwordsandbigramsinM1andM2{American,Airlines,Tim,Wagner,AmericanAirlines,TimWagner}

• WordsorbigramsinparticularpositionsleftandrightofM1/M2M2:-1spokesmanM2:+1said

• Bagofwordsorbigramsbetweenthetwoentities{a,AMR,of,immediately,matched,move,spokesman,the,unit}

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2

NamedEntityTypeandMentionLevelFeaturesforRelationExtraction

• Named-entitytypes• M1:ORG• M2:PERSON

• Concatenationofthetwonamed-entitytypes• ORG-PERSON

• EntityLevelofM1andM2 (NAME,NOMINAL,PRONOUN)• M1:NAME [itor hewouldbePRONOUN]• M2:NAME [thecompanywouldbeNOMINAL]

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2

ParseFeaturesforRelationExtraction

• BasesyntacticchunksequencefromonetotheotherNPNPPPVPNPNP

• ConstituentpaththroughthetreefromonetotheotherNPé NPé Sé Sê NP

• DependencypathAirlines<- matched->Wagner->said

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2

Gazeteer andtriggerwordfeaturesforrelationextraction• Triggerlistforfamily:kinshipterms• parent,wife,husband,grandparent,etc.[fromWordNet]

• Gazeteer:• Listsofusefulgeoorgeopoliticalwords

• Countrynamelist• Othersub-entities

AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.

Classifiersforsupervisedmethods

• Nowyoucanuseanyclassifieryoulike•MaxEnt• NaïveBayes• SVM• ...

• Trainitonthetrainingset,tuneonthedev set,testonthetestset

EvaluationofSupervisedRelationExtraction

•ComputeP/R/F1 foreachrelation

36

P = # of correctly extracted relationsTotal # of extracted relations

R = # of correctly extracted relationsTotal # of gold relations

F1 =2PRP + R

Summary:SupervisedRelationExtraction

+ Cangethighaccuracieswithenoughhand-labeledtrainingdata,iftestsimilarenoughtotraining- Labelingalargetraining setisexpensive

- Supervisedmodelsarebrittle, don’tgeneralizewelltodifferentgenres

RelationExtraction

Supervisedrelationextraction

RelationExtraction

Semi-supervisedandunsupervisedrelationextraction

Seed-basedorbootstrappingapproachestorelationextraction

•Notraining set?Maybeyouhave:• Afewseedtuplesor• Afewhigh-precisionpatterns

•Canyouusethoseseedstodosomethinguseful?• Bootstrapping:usetheseedstodirectlylearntopopulatearelation

RelationBootstrapping(Hearst1992)

•GatherasetofseedpairsthathaverelationR• Iterate:1. Findsentenceswiththesepairs2. Lookatthecontextbetweenoraroundthepair

andgeneralizethecontexttocreatepatterns3. Usethepatterns forgrep formorepairs

Bootstrapping

• <MarkTwain,Elmira>Seedtuple• Grep (google)fortheenvironmentsoftheseedtuple“MarkTwainisburiedinElmira,NY.”

XisburiedinY“ThegraveofMarkTwainisinElmira”

ThegraveofXisinY“ElmiraisMarkTwain’sfinalrestingplace”

YisX’sfinalrestingplace.

• Usethosepatternstogrep fornewtuples• Iterate

Dipre:Extract<author,book>pairs

• Startwith5seeds:

• FindInstances:TheComedyofErrors,byWilliamShakespeare,wasTheComedyofErrors,byWilliamShakespeare,isTheComedyofErrors,oneofWilliamShakespeare'searliestattemptsTheComedyofErrors,oneofWilliamShakespeare'smost

• Extractpatterns(groupbymiddle,takelongestcommonprefix/suffix)?x , by ?y , ?x , one of ?y ‘s

• Nowiterate,findingnewseedsthatmatchthepattern

Brin,Sergei.1998.ExtractingPatternsandRelationsfromtheWorldWideWeb.Author BookIsaacAsimov TheRobots ofDawnDavidBrin Startide RisingJamesGleick Chaos:MakingaNewScienceCharlesDickens GreatExpectationsWilliamShakespeare TheComedyofErrors

Snowball

• Similariterativealgorithm

• Groupinstancesw/similarprefix,middle,suffix,extractpatterns• ButrequirethatXandYbenamedentities• Andcomputeaconfidenceforeachpattern

{’s, in, headquarters}

{in, based} ORGANIZATIONLOCATION

Organization LocationofHeadquartersMicrosoft RedmondExxon IrvingIBM Armonk

E.Agichtein andL.Gravano 2000.Snowball:ExtractingRelationsfromLargePlain-TextCollections.ICDL

ORGANIZATION LOCATION .69

.75

DistantSupervision

•Combinebootstrappingwithsupervised learning• Insteadof5seeds,• Usealargedatabasetogethuge#ofseedexamples

•Createlotsoffeaturesfromalltheseexamples•Combineinasupervised classifier

Snow,Jurafsky,Ng.2005.Learningsyntacticpatternsforautomatichypernymdiscovery.NIPS17Fei WuandDanielS.Weld.2007.AutonomouslySemantifyingWikipeida.CIKM2007Mintz,Bills,Snow,Jurafsky.2009.Distantsupervisionforrelationextractionwithoutlabeleddata.ACL09

Distantsupervisionparadigm

• Likesupervised classification:• Usesaclassifierwithlotsoffeatures• Supervisedbydetailedhand-createdknowledge• Doesn’trequireiterativelyexpandingpatterns

• Likeunsupervised classification:• Usesverylargeamountsofunlabeleddata• Notsensitivetogenreissuesintrainingcorpus

Distantlysupervisedlearningofrelationextractionpatterns

Foreachrelation

Foreachtupleinbigdatabase

Findsentencesinlargecorpuswithbothentities

Extractfrequentfeatures(parse,words,etc)

Trainsupervisedclassifierusingthousandsofpatterns

4

1

2

3

5

PERwasborninLOCPER,born(XXXX), LOCPER’sbirthplaceinLOC

<EdwinHubble,Marshfield><AlbertEinstein,Ulm>

Born-In

HubblewasborninMarshfieldEinstein,born(1879),UlmHubble’sbirthplaceinMarshfield

P(born-in | f1,f2,f3,…,f70000)

Unsupervisedrelationextraction

• OpenInformationExtraction:• extractrelationsfromthewebwithnotrainingdata,nolistofrelations

1. Useparseddatatotraina“trustworthytuple”classifier2. Single-passextractallrelationsbetweenNPs,keepiftrustworthy3. Assessorranksrelationsbasedontextredundancy

(FCI,specializesin,softwaredevelopment)(Tesla,invented,coiltransformer)

48

M.Banko,M.Cararella,S.Soderland,M.Broadhead, andO.Etzioni.2007.Openinformationextractionfromtheweb. IJCAI

EvaluationofSemi-supervisedandUnsupervisedRelationExtraction

• Sinceitextractstotallynewrelationsfromtheweb• Thereisnogoldsetofcorrectinstancesofrelations!• Can’tcomputeprecision(don’tknowwhichonesarecorrect)• Can’tcomputerecall(don’tknowwhichonesweremissed)

• Instead,wecanapproximateprecision(only)• Drawarandomsampleofrelationsfromoutput,checkprecisionmanually

• Canalsocomputeprecisionatdifferentlevelsofrecall.• Precisionfortop1000newrelations,top10,000newrelations,top100,000• Ineachcasetakingarandomsampleofthatset

• Butnowaytoevaluaterecall49

P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample

RelationExtraction

Semi-supervisedandunsupervisedrelationextraction

top related