zhuchengtu| cs 898 spring 2017 july 17, 2017mli/michael-tu.pdf · +0.0037 +0.0012 +0.012 23....
TRANSCRIPT
-
ModellingSentencePairSimilaritywithMulti-PerspectiveConvolutionalNeuralNetworksZHUCHENG TU |CS898SPRING2017
JULY17, 2017
1
-
OutlineMotivation◦ Whydowewanttomodelsentencesimilarity?◦ Challenges
ExistingWorkonSentenceModeling
Multi-PerspectiveCNN
ModificationsandResults
FutureWork
2
-
MotivationModelingthesimilarityofapairofsentencesiscriticaltomanyNLPtasks:
◦ Paraphraseidentification,ex.plagiarismdetectionordetectingduplicatequestions◦Questionanswering,ex.answerselection◦Queryranking
3
-
Whatmakessentencemodellinghard?◦Differentwaysofsayingthesamething◦Littleannotatedtrainingdata◦ “Difficulttousesparse,hand-craftedfeaturesasinconventionalapproachesinNLP”(Heetal.,2015)
4
-
ExistingWork◦ Beforedeeplearningmethods,methodsincluded◦ N-gramoverlaponwordandcharacters◦ Knowledge-based, e.g.usingWordNet◦ …◦ Combinationsofthesemethodsandmulti-tasklearning
◦ Deeplearningmethods:◦ Collobert andWeston(2008)trainedCNNinmultitasksetting◦ Kalchbrenner etal.(2014)useddynamick-maxpooling tohandlevariablesizedinput◦ Kim(2014)usedfixed&learnedwordvectorsandvaryingwindowsizes&convolution filters◦ moreCNNs…◦ Taietal.(2015)andZhuetal.(2015)usedtree-basedLSTM
5
-
Multi-PerspectiveCNN◦ Basedon:HuaHe,KevinGimpel,andJimmyLin.2015.Multi-Perspectivesentencesimilaritymodelingwithconvolutionalneuralnetworks.InProceedingsofEMNLP,pages1576–1586.◦ Comparesentencepairsusinga“multiplicityofperspectives”◦ Twocomponents:sentencemodel andsimilaritymeasurementlayer◦ Advantages:◦ Donotusesyntaxparsers◦ Donotneedunsupervisedpre-trainingstep
6
-
Multi-PerspectiveCNNArchitecture
SentenceModel
7
-
PreparingInput◦ UseGloVe (840Btokens,2.2Mvocab,300dvectors)tocreatesentenceembedding◦ UsevaluesfromNormal(0,1) forwordsnotfound invocab◦ Padsentenceembedding tocreateuniformly-sizedbatchesforfasterGPUtraining
Agroupofkidsisplayinginayardandanoldmanisstandinginthebackground
Agroupofboysinayardisplayingandamanisstandinginthebackground
8
-
SentenceModelling:Multi-PerspectiveConvolution
Holisticfilters Per-dimensionalfilters
Twotypesofconvolution foreachsentence
9
-
SentenceModeling:MultiplePoolingMultiple typesofpooling fortypeofconvolution, wecallthegroupoffiltersforaparticularconvolution typea“Block”
10
-
SentenceModeling:MultipleWindowSizesMultipleblocks,eachcorresponding toaparticularwidth
ws =1
ws =2
ws =3
Aspecialws =∞correspondswiththeentiresentence
11
-
SentenceModelling:Puttingittogether
12
-
SentenceModelling:Puttingittogether
13
-
SimilarityMeasurementLayer◦ Wecanflattentheoutputsfromthedifferentblocksintoa1Dvectorandcomparetheresult
◦ Problem:differentpartsoftheflattenedvectorrepresentdifferentresults,socomparingflattenedvectormightcapturelessinformation
◦ Instead,wecancompareovernon-flattenedlocalregions
14
-
LocalRegionComparisonsHorizontalcomparison:comparing localregionsofthetwosentencesbasedonmatchingpoolingmethodandwindowsizeforholisticfiltersonly.CompareusingcosinedistanceandEuclideandistance.
Verticalcomparison:Similar,butinverticaldirectionforbothholisticandper-dimension filters.Compareusingcosinedistance,Euclideandistance,andelement-wiseabsolutevalue.
15
-
OtherModelDetails◦ Fully-ConnectedLayer:Aftersimilaritymeasurement,addtwolinearlayerswithtanh activationlayerinbetween
◦ Finallayerislog-softmax layer
16
-
Re-Implementation◦ModelusedinthepaperwaswritteninTorch◦ Re-implementmodelinPyTorch asapartofwidereffortsinresearchgroup◦Makesomechangestothenetworkandcompareperformance
17
-
Datasetsforexperiments◦ SICK◦ SentencesInvolvingCompositionalKnowledge◦ 9927sentencepairs– 4500training,500dev,4927testing◦ Scoresareinrange[1,5]
◦ MSRVID◦ MicrosoftVideoParaphraseCorpus◦ 1500sentencepairs– 750training,750testing◦ Sincenodev setisprovided,~20%ofthetrainingdataisheldoutforvalidationineachepoch
◦ Scoresareinrange[0,5]
18
-
Training◦Use300spatialfiltersand20per-dimensionfilters◦ BothdatasetsaretrainedusingAdam,usingKL-divergencelosswithL2regularizationpenaltyof0.001◦Usebatchsizeof64forSICK,16forMSRVID◦ Learningrate:initially,0.1,butdecreasesbyafactorof~3ifvalidationperformancedonotimproveafter2epochs(reducelearningrateonplateau)◦ Shuffletrainingdataaftereveryepoch
19
-
LearningCurve
TrainingsetlossforSICKdataset DevsetlossforSICKdataset*Note:trainingsetloss isshowingsummed loss overbatches,dev setlossisshowing averagelossperbatch.Duetooversight.Ididnothavetimebeforethepresentationtomakethemconsistent.
20
-
Evaluationmetriccurve
Pearson’srondev set
21
-
BenchmarkofRe-Implementation
r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083
Paper 0.8686 0.8047
Re-impl. 0.8553 0.7905
r
Beltagy etal.(2014) 0.8300
Bär etal.(2012) 0.8730
Šarić etal.(2012) 0.8803
Paper 0.9090
Re-impl. 0.8668
SICKDataset MSRVIDDataset
rreferstoPearson’sr𝜌 referstoSpearman’s𝜌
22
-
Modification1:Dropout
r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083
Paper 0.8686 0.8047
Re-impl.w/modif. 0.8590 0.7917
r
Beltagy etal.(2014) 0.8300
Bär etal.(2012) 0.8730
Šarić etal.(2012) 0.8803
Paper 0.9090
Re-impl.w/modif. 0.8788
SICKDataset MSRVIDDataset
Usingdropoutprobability =0.5
+0.0037 +0.0012 +0.012
23
-
Modification2:BatchRenormalization
r
0.8604
MSRVIDDataset
r 𝜌0.8016 0.7415
SICKDataset
Unfortunatelybatchnormalization didnotimprovetheperformancewiththedefaultparameters
24
-
Modification3:SymmetricCompareUnit
r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083
Paper 0.8686 0.8047
Re-impl.w/modif. 0.8565 0.7883
r
Beltagy etal.(2014) 0.8300
Bär etal.(2012) 0.8730
Šarić etal.(2012) 0.8803
Paper 0.9090
Re-impl.w/modif. 0.8741
SICKDataset MSRVIDDataset
Comparedwithaddingdropoutasbaseline, thisdidnotimproveperformance
-0.0035 -0.0034 -0.0047
25
-
RandomizedGridSearch
testandval metricsshowPearson’sr.FoundbetterperformanceforMSRVIDdataset.Asanimprovement, cantrypickingfromarandomsetofreasonablediscreteparametersinstead.ThankstoSalmanMohammedforrandomizedhyperparameter searchscript.
+0.001
26
-
WorkinProgress◦ Addingattentionmoduleinparallelwithconvolutionlayers(Yinetal.,2016)◦ Addingsparsefeatures(e.g.idf)tofirstlinearlayer◦ Evaluateperformanceonothertasks◦ TrecQA forquestionanswering◦ SNLIforinference(contradiction,entailment,neutral)
27
-
References◦ HuaHe,KevinGimpel,andJimmyLin.2015.Multi-Perspectivesentencesimilaritymodeling withconvolutionalneuralnetworks.In ProceedingsofEMNLP,pages1576–1586.
◦ RonanCollobert andJasonWeston.2008.Aunifiedarchitecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InProceedingsofthe25thInternationalConferenceonMachine learning,pages160–167.
◦ Nal Kalchbrenner,EdwardGrefenstette,andPhilBlunsom.2014.Aconvolutionalneuralnetworkformodelling sentences.InProceedingsofthe52ndAnnualMeetingoftheAssociation forComputationalLinguistics.
◦ YoonKim.2014.Convolutionalneuralnetworksforsentenceclassification.InProceedingsofthe2014ConferenceonEmpiricalMethodsforNaturalLanguageProcessing.
◦ KaiShengTai,RichardSocher,andChristopherD.Manning.2015.Improvedsemanticrepresentationsfromtree-structuredlongshort-termmemorynetworks.InProceedingsofthe53rdAnnualMeetingoftheAssociation forComputational Linguistics.
28
-
ReferencesCont’ed◦ Xiaodan Zhu,Parinaz Sobhani,andHongyu Guo.2015.Longshort-termmemoryoverrecursivestructures.InProceedingsofthe32ndInternationalConferenceonMachineLearning,pages1604–1612.
◦ DanielBar,ChrisBiemann,Iryna Gurevych,and¨ Torsten Zesch.2012.UKP:computing semantictextualsimilaritybycombiningmultiplecontentsimilaritymeasures.InProceedingsoftheFirstJointConferenceonLexicalandComputational Semantics,pages435–440.
◦ Frane Sariˇc,GoranGlava ´ s,Mladen Karan,JanˇSnajder, ˇandBojana Dalbelo Basiˇc.2012.TakeLab:systems´ formeasuringsemantictextsimilarity.InProceedingsoftheFirstJointConferenceonLexicalandComputationalSemantics,pages441–448.
◦ IslamBeltagy,Katrin Erk,andRaymondMooney.2014.Probabilisticsoftlogicforsemantictextualsimilarity.Proceedingsof52ndAnnualMeetingoftheAssociation forComputational Linguistics,pages1210–1219.
◦ Wenpeng Yin,Hinrich Schütze,BingXiang,andBowenZhou.Abcnn:Attention-basedconvolutionalneuralnetworkformodelingsentencepairs.InACL,2016.
29