word sense determination from wikipedia data using neural ... · in proceedings of the joint...

WordSenseDeterminationfromWikipediaDataUsing

NeuralNetworks

AdvisorDr. Chris Pollett

Committee MembersDr. JonPearceDr. Suneuy Kim

ByQiaoLiu

Agenda

• Introduction• Background• ModelArchitecture• DataSetsandDataPreprocessing• Implementation• ExperimentsandDiscussions• ConclusionandFutureWork

Introduction

• Wordsensedisambiguationisthetaskofidentifyingwhichsenseofanambiguouswordisusedinasentence.

in1890,hebecamecustodianoftheMilwaukeepublicmuseumwherehecollectedplant specimensfortheirgreenhouse

…...sendcollectedfluidtoamunicipalsewagetreatmentplant oracommercialwastewatertreatmentfacility

• Wordsensedisambiguationisusefulinnaturallanguageprocessingtasks,suchasspeechsynthesis,questionanswering,andmachinetranslation.

Introduction

Sensediscrimination Senselabeling

Sensediscrimination Senselabeling

WordSenseDisambiguation

Lexicalsampletask

All-wordstaskProjectpurpose

• Twovariantsofwordsensedisambiguationtask:

lexicalsampletaskall-wordstask

• Twosubtasks:sensediscriminationsenselabeling

Background

ExistingWork

Background

Approach1:Dictionary-based

Givenatargetwordt tobedisambiguatedinContextc.1. retrieveallthesensedefinitionsfortfromadictionary.2. selectthesenseswhosedefinitionhavethemostoverlapwithcoft.

• Thisapproachrequiresahand-builtmachinereadablesemanticsensedictionary.

Background

Approach2:Supervisedmachinelearning

1. Extractasetoffeaturesfromthecontextofthetargetword.2. Usethefeaturetotrainclassifiersthatcanlabelambiguouswordsin

newtext.

• Thisapproachrequirescostlylargehand-builtresources,becauseeachambiguouswordneedbelabelledintrainingdata.

• Asemi-supervisedapproachwasproposedin1995byYarowsky.Inthisapproach,theydonotrelyonalargehand-builtdata,duetousingbootstrappingtogeneratedictionaryfromasmallhand-labeledseed-set.

Background

Approach3:Unsupervisedmachinelearning

Interpretthesenseoftheambiguouswordasclustersofsimilarcontexts.Contextsandwordsarerepresentedbyahigh-dimensional,real-valuedvectorusingco-occurrencecounts.

• Inourproject,weuseamodificationofthisapproach:• Wordembeddings aretrainedusingWikipediapages.• Wordvectorsofcontextscomputedbytheseembeddingarethenclustered.• Givenanewwordtodisambiguate,weuseitscontextandtheword

embeddingtofindawordvectorcorrespondingtothiscontext.Thenwedeterminetheclusteritbelongs.

• Inrelatedwork,Schütze usedadatasettakenfromtheNewYorkTimesNewsService anddidclusteringbutwithadifferentkindofwordvector.

Background

• Wordembeddings

Awordembeddingisaparameterizedfunctionmappingwordsinsomelanguagetohigh-dimensionalvectors(perhaps200to500dimensions)

word→𝑅"W(“plant”)=[0.3,-0.2,0.7,…]W(“crane”)=[0.5,0.4-0.6,…]

ModelArchitecture

• ManyNLPtaskstaketheapproachoffirstlearningagoodwordrepresentationonataskandthenusingthatrepresentationforothertasks.Weusedthisapproachforthewordsensedeterminationtask.

ModelArchitecture

• Learnagoodwordrepresentationofataskandthenusingthatrepresentationforothertasks.

• WeusedtheSkip-grammodelastheneuralnetworklanguagemodellayer

ModelArchitecture

Skip-gramModelArchitecture• Thetrainingobjectivewastolearnwordembeddings goodatpredictingthe

contextwordsinasentence.• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetwordand

contextwordfoundinourtrainingdataset.

𝐽$ 𝜃 = ( ( 𝑝(𝑤,-.|𝑤,; 𝜃1�

345.54.67

8

,9:

𝐽 𝜃 = −1𝑉> > log( 𝑝(𝑤,-.|𝑤,; 𝜃)1

�

345.54.67

8

,9:

𝑝 𝑤C 𝑤, =ex p(𝑤CG𝑤,)

∑ ex p(𝑤.G𝑤,18.9:

• k-meansclustering

k-meansisasimpleunsupervisedclassificationalgorithm.Theaimofthek-meansalgorithmistodividempointsinndimensionsintokclusterssothatthewithin-clustersumofsquaresisminimize

Thedistributionalhypothesissaysthatsimilarwordsappearinsimilarcontexts[9,10].Thus,wecanusek-meanstodivideallvectorsofcontextintokclusters.

ModelArchitecture

• Datasourcehttps://dumps.wikimedia.org/enwiki/20170201/Thepages-articles.xml ofWikipediadatadumpcontainscurrentversionofallarticlepages,templates,andotherpages.

• TrainingdataformodelWordpairs:(targetword,contextword)

DataSetsandDataPreprocessing

Sentence Trainingsamples (windowsize=2)

natural languageprocessingprojectsarefun (natural,language), (natural,processing)

naturallanguage processingprojectsarefun (language,natural), (language,processing), (language,projects)

naturallanguageprocessing projectsarefun (processing,natural), (processing,language), (processing,projects)

naturallanguageprocessingprojects arefun (projects,language), (projects,processing), (projects,are), (projects,fun)

naturallanguageprocessingprojectsare fun (are,processing), (are,project), (are,fun)

naturallanguageprocessingprojectsarefun (fun,projects), (fun,are)

DataSetandDataPreprocessing

Stepstoprocessdata:• Extracted90Msentences

• Countedwords,createdadictionaryandareverseddictionary

• Regeneratedsentences

• Created5Bwordpairs

Implementation

Theoptimizer:• Gradientdescent findstheminimumofafunctionbytakingsteps

proportionaltothe positive ofthegradient.Ineachiterationofgradientdescent,weneedtocalculateallexamples.

• Insteadofcomputingthegradientofthewholetrainingset,eachiterationofstochasticgradientdescent onlyestimatesthisgradientbasedonabatchofrandomlypickedexamples.

Weusedstochasticgradientdescenttooptimizethevectorrepresentationduringtraining.

Implementation

Theparameters:Parameters Meaning

VOC_SIZE Thevocabularysize.

SKIP_WINDOW Thewindowsizeoftextwordsaroundtargetword.

NUM_SKIPS Thenumberofcontextwords,whichwillberandomlytooktogeneratewordpairs.

EMBEDDING_SIZE Thenumberofparametersinthewordembedding.Thesizeofthewordvector.

LR Thelearningrateofgradientdescent

BATCH_SIZE Thesizeofeachbatchinstochasticgradientdescent.Runningonebatch isonestep.

NUM_STEPS Thenumberoftrainingstep.

NUM_SAMPLE Thenumberofnegativesamples.

Implementation

Toolsandpackages:

• TensorFlow r1.4• TensorBoard 0.1.6• Python2.7.10• WikipediaExtractorv2.55• sklearn.cluster [15]• numpy

ExperimentsandDiscussions

TheexperimentalresultsarecomparedwithSchütze’sunsupervisedlearningapproachin1998:• Schütze usedadataset(435M)takenfromtheNewYork

TimesNewsService.WeusedthedatasetextractedfromWikipediapages(12G).

• Schütze usedco-occurrencecountstogeneratevectors,whichhadlargenumbersofvectordimension(1,000/2,000).WeusedtheSkip-grammodeltolearnadistributedwordrepresentationwithadimensionof250.

• Schütze appliedsingular-valuedecompositionduetolargenumbersofvectordimension.Takingadvantageofasmallernumberofdimension,wedidnotneedtoperformmatrixdecomposition.

• WeexperimentedtheSkip-grammodelwithdifferentparametersandselectedonewordembeddingforclustering.

• Skip-grammodelparameters


Experimentwithskip-grammodel• Used“averageloss”toestimatetheloss

overevery100Kbatches.• Visualizedsomewords’nearestwords.


Experimentwithclassifyingwordsenses• Clusteredthecontextsoftheoccurrencesofgivenambiguouswordinto

two/threecoherentgroups.• Manuallyassignedlabelstotheoccurrencesofambiguouswordsinthetest

corpus,andcomparethemwithmachinelearnedlabelstocalculateaccuracy.• Beforewordsensedetermination,weassignedalloccurrencestothemost

frequentmeaning,andusedthefractionasthebaseline.

𝑁𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑤𝑖𝑡ℎ𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑚𝑎𝑐ℎ𝑖𝑛𝑒𝑙𝑒𝑎𝑟𝑛𝑒𝑑𝑠𝑒𝑛𝑠𝑒𝑙𝑎𝑏𝑒𝑙𝑇ℎ𝑒𝑡𝑜𝑡𝑎𝑙𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑡𝑒𝑠𝑡𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

accuracy =


• “Schütze’s baseline”columngivesthefractionofthemostfrequentsenseinhisdatasets.

• “Schütze’s accuracy”columngivestheresultsofhisdisambiguationexperimentswithlocaltermsfrequencyifapplicable.

• Wegotbetteraccuracyoutofexperimentswith“capital”and“plant”.

• However,themodelcannotdeterminethesensesofword“interest”and“sake”,whichhasabaselineover85%inourdatasets.


Discussions• Ourdatasets(12G)aremuchlargerthanSchütze’s datasets(435M).

Forexample,thesizeofhistrainingsetforword“capital”is13,015,andoursis179,793.Thelargerdatasetsmighthavehelpedtoincreasetheaccuracyforsomewords.

• Wealsoobservedthatwhenthebaselineishigh(>=85%),themodelcannotdeterminethesensesoftheword.Theperformanceofunsupervisedlearningreliesonsufficientinformationfromthetrainingdata.However,themodeldidn’tgettrainedwithsufficientdatacarryinglessfrequentmeanings.

• Thesizeofthetrainingdata,andthedistributionofthesensesofthetargetwordhassignificantinfluenttotheperformanceofthemodel.


Conclusion

• Inthisproject,weutilizedthedistributionalwordrepresentationandthedistributionalhypothesistobuildamodularmodeltoclassifythesensesofambiguouswords.

• Ourexperimentsshowedourmodelperformedwellwhenanambiguouswordhadeachsenseaccountsforthan20%ofoccurrencesinthetrainingdataset.

ConclusionandFutureWork

FutureWork• Optimizetheclassifier.Onepossibleapproachmightbeusing

weightedsumofcontextsbytakingIDFintoaccount.• Extendandexperimentthisapproachtoothermodelswith

differentclassifiers.Theclassifierwhichworkswellwhenoccurrencesareskewedtooneclassmightimprovetheaccuracyforwordswithlargeportionofoccurrencesareusingthemostfrequentsense.

• Tokenizethecorpus,wecouldreducethetimecostoftrainingbyreducingvocabularysize.

ConclusionandFutureWork

• Y.Bengio,R.Ducharme,P.Vincent.Aneuralprobabilisticlanguagemodel.JournalofMachineLearningResearch,3:1137-1155,2003.

• TomasMikolov,KaiChen,GregCorrado,andJeffreyDean.Efficientestimationofwordrepresentationsinvectorspace.ICLRWorkshop,2013.

• G.E.Hinton,J.L.McClelland,D.E.Rumelhart.Distributedrepresentations.In:Paralleldistributedprocessing:Explorationsinthemicrostructureofcognition.Volume1:Foundations,MITPress,1986.

• T.Brants,A.C.Popat,P.Xu,F.J.Och,andJ.Dean.Largelanguagemodelsinmachinetranslation.InProceedingsoftheJointConferenceonEmpiricalMethodsinNaturalLanguageProcessingandComputationalLanguageLearning,2007.

• DavidERumelhart,GeoffreyEHintont,andRonaldJWilliams.Learningrepresentationsbybackpropagating errors.Nature,323(6088):533–536,1986.

• H.Schwenk.Continuousspacelanguagemodels.ComputerSpeechandLanguage,vol.21,2007.• T.Mikolov,A.Deoras,S.Kombrink,L.Burget,J.Cˇernocky´.EmpiricalEvaluationandCombination

ofAdvancedLanguageModelingTechniques,In:ProceedingsofInterspeech,2011.

References

• TomasMikolov,IlyaSutskever,KaiChen,GregS.Corrado,andJeffDean.Distributedrepresentationsofwordsandphrasesandtheircompositionality.InAdvancesinNeuralInformationProcessingSystems,2013a.

• JamesR.CurranandMarcMoens.Improvementsinautomaticthesaurusextraction.InProceedingsoftheACL-02workshoponUnsupervisedlexicalacquisition,pages59–66.2002.

• PatrickPantel andDekang Lin.Discoveringwordsensesfromtext.InProc.OfSIGKDD-02,pages613–619,NewYork,NY,USA.ACM.2002.

• MichaelLesk.Automaticsensedisambiguationusingmachinereadabledictionaries:Howtotellapineconefromanicecreamcone.InProceedingsofSIGDOC,pages24-26,1986.

• Olah,Christopher.DeepLearning,NLP,andRepresentations.Retrievedfromhttp://colah.github.io/posts/2014-07-NLP-RNNs-Representations/.2014

• Hartigan,J.A.andWong,M.A.AlgorithmAS136:AK-MeansClusteringAlgorithm.JournaloftheRoyalStatisticalSociety.SeriesC(AppliedStatistics).28(1):pages100–108,1979.

• Schütze,Hinrich.Dimensionsofmeaning.InProceedingsofSupercomputing’92,pages787-796,1992.

References

• Pedregosa etal.,Scikit-learn:MachineLearninginPython,JMLR12,pp.2825-2830,2011.• MichaelUGutmann andAapo Hyv¨arinen.Noise-contrastiveestimationofunnormalized

statisticalmodels,withapplicationstonaturalimagestatistics.TheJournalofMachine LearningResearch,13:307–361,2012.

• Bottou L.(2010)Large-ScaleMachineLearningwithStochasticGradientDescent.In:LechevallierY.,Saporta G.(eds)ProceedingsofCOMPSTAT'2010.Physica-Verlag HD

• TensorFlow Tutorial,tf.nn.nce_loss.Retriveved fromhttps://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.2017

• McCormick,C,Word2VecTutorialPart2- NegativeSampling.Retrievedfrom http://www.mccormickml.com,2017,January11.

• D.Yarowsky,Unsupervisedwordsensedisambiguationrivalingsupervisedmethods,Proc.33rdAnnualmeetingoftheACL,Cambridge,MA,USA,pp189-196,1995.

• Schütze,Hinrich,Automaticwordsensediscrimination,ComputationalLinguistics,v.24n.1,March1998

References

Questions

Thank You!

Appendix: ModelArchitecture

Skip-grammodelarchitecture• Wetrainedtheneuralnetworkbyfeedingitwordpairsoftargetword

andcontextwordfoundinourtrainingdataset.

word sense determination from wikipedia data using neural ... · in proceedings of the joint...

Documents