named entity recognition and linking...2018/01/24 · definitions •entity: something that exists...
TRANSCRIPT
NamedEntityRecognitionandLinking
ArtūrsZnotiņš,[email protected]
Supervisor:Guntis Bārzdiņš,Dr.sc.comp.
TheProblem
• Largeamountsofunstructureddata– naturallanguage(WWW,books,newspapers,radio)• Alotofambiguity- contextisveryimportant• Humansaregoodatsemanticdisambiguation:• Whatentitiesdoestextreferto?• Whatfactsdoestextdescribe?• Whatisthemeaningofspecificword?
• Howtodothisautomatically?
2
Motivation
• UtilizeexistingunstructurednaturallanguageresourcesHiddenknowledge
• Usecases:• Informationextraction(IE)• Textclustering• Mediamonitoring• Dialogsystems• Questionanswering• Machinetranslation
3
Cikos atiet vilciens uz Daugavpili?
Definitions
• Entity:somethingthatexistsbyitself• Namedentity(NE): entitywithspecificname• Mention:phrasethatreferstoentity• Knowledgebase(KB):organizedrepositoryofknowledgeconsistingofconcepts,properties,links• (Named)EntityLinking(NEL):linkingmentionsofentitieswithinatexttoKBentities• WordSenseDisambiguation(WSD):assigningmeaningstowordoccurrenceswithintext
4
KnowledgeBase:Example
5https://www.wikidata.org/wiki/Wikidata:Introduction
KnowledgeBase:Example
6https://www.wikidata.org/w/index.php?search=&search=andris+berzins
NamedEntityLinking:Example
“Arī otra figūra Daimlerlietā ir Bojāra ārštatapadomnieks,unsens eksmēra draugs noarmijas laikiem
– Armands Zeihmanis.”(notvnet.lv)
• Bojāra→GundarsBojārs,dz.1967https://lv.wikipedia.org/wiki/Gundars_Bojārs
• Otra figūra=Bojāra =eksmēra
• Bojārs↔Zeihmanis:draugs,padomnieks
7
NEL:GeneralApproach
• NamedEntityRecognition• CandidateSelection
SelectingpossiblecandidatesfromthetargetKnowledgeBase
• DisambiguationDecidingwhichcandidateisthecorrectidentitycorrespondingtothementionofaNamedEntity
8
NEL:ContextFreeApproach
• ExtractsurfaceformsfromKBorannotatedcorpus• DBpedia labels(rathersparse)• InternallinksofWikipedia
• Cleanandcatalog• Faststringmatch
+ Simple+ Highprecision- Lowrecall- Doesnotsolveambiguities
9
𝑈"# = 𝑒# ∈ 𝐸"#: 𝐶 𝑒# ≥ 𝛼×-𝐶 𝑒.
/01
.
Generatenamealternatives
Decideonwhichsurfaceformshaveambiguouslabelswhichcannotbeconsideredwithoutcontext
NEL:KnowledgeRichApproach
• Stringmatchbasedcandidateselection• Mentionattributes:
• Bag-of-name-words• Bag-of-mention-words• Bag-of-context-words
• Cosinesimilaritybetweenvectors
10
Compatibilityfunction
Entity(KB)
Subentity (documentlevel)
Mentions+ Cansolveambiguities+ KBcanbeenrichedwithvalidatedentities- Performancedependsonpre-processingcomponents
Wicketal,2013;Stoyanov etal,2012
NEL:ProcessingPipeline
11
Morphologyandsyntax
NER
Knowledgebase
Co-referenceresolution
Annotateddocument
EntityLinking
Textdocument
MorphologyandSyntax
• Lexiconbasedmorphologicalanalyzer• CMMmorphologicaltagger• LSTMtransition-basedparser• Wordembeddings• CharacterLSTMrepresentation
12
Model UD(UAS, %)
LSTM+ WE 75.1
LSTM+WE+morpho-tag 75.4
• Inflectiongeneration• Mentioncandidate
selection
Paikens etal,2013;Pretkalniņa etal,2016
NamedEntityRecognition
13
Model NER(F1, %)
BaselineCRF 83.4
LSTM-CRF 63.7
LSTM-CRF+WE 90.5
NER:LearningCurve
0% 20% 40% 60% 80% 100%
Corpus fraction used for training
50%
60%
70%
80%
90%
100%
F1-
scor
e
58.8%
66.0%
57.1%
53.9%
80.1%
62.6%
57.1%
84.5%
72.1%
60.0%
86.6%
83.4%
63.7%
90.2%
CRF
LSTM-CRF
LSTM-CRF + embeddings
14
NamedEntityrecognition
15
Bi-directionalLSTM-CRF• Pre-trainedwordembeddings• CharacterLSTMorCNNrepresentations• Jointcross-labelmodelling
CRFlayer
LSTM
Wordembeddings
Characterlevelwordvector
Lample etal,2015
Co-referenceresolution
• Mentionsthatreferstothesamerealwordentity
16
Followinginthefootstepsoftheir fatherDainis,anex-bobsleighspecialist,brothersMartins andTomassDukurs arecurrentlyrankedamongtheglobe’stopskeletonracers.Martins,worldnumberone,isthegoldmedalfavourite.He dreamsofsharingthepodiumwithhis eldersibling.
• {theirfatherDainis,anex-bobsleighspecialist}• {their,brothersMartinsandTomass Dukurs }• {Martins,Martins,worldnumberone,thegoldmedalfavourite,He,his}• {Tomass Dukurs,hiseldersibling}
Co-referenceResolution
• MentiondetectionNE,pronouns,nounphrases,etc.
• Mentionchaining– clustering• Grammaticallyrelatedmentions• Salience(gender,number,person,etc.)• Pairwiseorclusterbased
• Representativenameandtype(person,etc.)
17
LVCoref:RuleBasedSystem
• Exactstringmatch• Precisegrammaticalconstructions• Appositives (AndrisBērziņš,thepresidentofLatvia)• Nominalpredicatives (Jānis Bērziņšisaprofessor)• Acronyms
• Headmatchvariations (SupremeCourtoftheRepublicofLatvia↔SupremeCourtofLatvia)
• Pronounanaphora
18
LVCoref:Results
19
F1 P R
Predictedmentions
Exactmatch 40.0 74.2 28.0
+Preciseconstruction 46.2 74.5 34.3
+Headmatch 56.2 69.4 47.3
+Pronouns 58.0 65.1 52.3
Goldmentions
Exactmatch 42.8 86.0 29.2
+Preciseconstruction 45.4 98.4 29.5
+Headmatch 66.8 88.5 54.1
+Pronouns 76.5 87.0 68.9
LVCoref: Components
0.3
0.9
0.1
0.7
1.12.3
-1 0 1 2 3 4
Gender
Number
Syntacticalconstraints
Entitylevelconstraints
Namedentitylabels
Syntacticalinformation
pp
ΔR ΔP ΔF120
DataannotationforLatvian• SelectedparagraphsfromTheBalancedCorpusofModernLatvian
• NER/NEL• 8categories:person,organization,location,GPE,product,event,time,entity
• DerivedfromMUCandAMRguidelines• LinkstoWikipedia
• Co-references:• Namedentities,nounphrasesandpronounsreferringtospecificentities
• DerivedfromOntoNotes 6.0guidelines• Alsosyntax,frames,AMR
Projects:• Tekstaautomātiskasdatorlingvistikas analīzespētījumsjaunainformācijasarhīvaproduktaizstrādē(2013-2014)
• Daudzslāņu valodas resursu kopa teksta semantiskajai analīzei unsintēzei latviešu valodā (2016-2019) 21
FutureWork
• Moreannotateddata• ImprovementiondetectionforLatvian• HighimpactonCR• Jointlearningwithshallowsyntaxparsing
• Namedentityrecognition• Hierarchicalnamedentitymentions• IncorporateneuralLSTMcharacterlevellanguagemodel
• Co-referenceresolutionJointlearningwithNEL
• NamedentityrecognitioninspeechSubword andclassbasedlanguagemodels
• Cross-lingualnamedentitylinking22
Publikācijas• Miranda,S.,Znotins,A.,Cohen,S.B.,Barzdins,G.(2017).MultilingualClusteringofStreamingNews.SubmittedtoNAACLHLT.
• Znotiņš,A.(2016).WordEmbeddingsfor Latvian NaturalLanguage ProcessingTools.InHumanLanguageTechnologies:TheBalticPerspective,IOSPress.WebofScience.
• Znotiņš,A.,Polis,K.,andDarģis R.(2015).MediamonitoringsystemforLatvianradioandTVbroadcasts.InProceedingsofthe16thAnnualConferenceoftheInternationalSpeechCommunicationAssociation(INTERSPEECH),732–733.WebofScience,Scopus.
• Darģis,R.andZnotiņš,A.(2014).BaselineforKeywordSpottinginLatvianBroadcastSpeech.InHumanLanguageTechnologies:TheBalticPerspective,IOSPress,75–82.WebofScience.
• Znotiņš,A.andPaikens,P.(2014).Coreference ResolutionforLatvian.InProceedingsoftheNinthInternationalConferenceonLanguageResourcesandEvaluation(LREC),3209–3213.WebofScience.
• Pretkalniņa,L.,Znotiņš,A.,Rituma,L.,Goško,D.(2014).Dependencyparsingrepresentationeffectsontheaccuracyofsemanticapplications— anexampleofaninflectivelanguage.InProceedingsoftheNinthInternationalConferenceonLanguageResourcesandEvaluation(LREC),4074–4081.WebofScience.
23
Summary
• NER/NEL:findallentitymentionswithinatextandlinkthemtoauthoritativedatasource(knowledgebase)• NELastextanchoring:• Coarsesummaryofwhattextisabout• OthertaskscanutilizeinformationfromKB• Othertaskscanenrichnamedentitieswithadditionalsemanticannotationlayers
24