zhuchengtu| cs 898 spring 2017 july 17, 2017mli/michael-tu.pdf · +0.0037 +0.0012 +0.012 23....

29
Modelling Sentence Pair Similarity with Multi-Perspective Convolutional Neural Networks ZHUCHENG TU | CS 898 SPRING 2017 JULY 17, 2017 1

Upload: others

Post on 04-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • ModellingSentencePairSimilaritywithMulti-PerspectiveConvolutionalNeuralNetworksZHUCHENG TU |CS898SPRING2017

    JULY17, 2017

    1

  • OutlineMotivation◦ Whydowewanttomodelsentencesimilarity?◦ Challenges

    ExistingWorkonSentenceModeling

    Multi-PerspectiveCNN

    ModificationsandResults

    FutureWork

    2

  • MotivationModelingthesimilarityofapairofsentencesiscriticaltomanyNLPtasks:

    ◦ Paraphraseidentification,ex.plagiarismdetectionordetectingduplicatequestions◦Questionanswering,ex.answerselection◦Queryranking

    3

  • Whatmakessentencemodellinghard?◦Differentwaysofsayingthesamething◦Littleannotatedtrainingdata◦ “Difficulttousesparse,hand-craftedfeaturesasinconventionalapproachesinNLP”(Heetal.,2015)

    4

  • ExistingWork◦ Beforedeeplearningmethods,methodsincluded◦ N-gramoverlaponwordandcharacters◦ Knowledge-based, e.g.usingWordNet◦ …◦ Combinationsofthesemethodsandmulti-tasklearning

    ◦ Deeplearningmethods:◦ Collobert andWeston(2008)trainedCNNinmultitasksetting◦ Kalchbrenner etal.(2014)useddynamick-maxpooling tohandlevariablesizedinput◦ Kim(2014)usedfixed&learnedwordvectorsandvaryingwindowsizes&convolution filters◦ moreCNNs…◦ Taietal.(2015)andZhuetal.(2015)usedtree-basedLSTM

    5

  • Multi-PerspectiveCNN◦ Basedon:HuaHe,KevinGimpel,andJimmyLin.2015.Multi-Perspectivesentencesimilaritymodelingwithconvolutionalneuralnetworks.InProceedingsofEMNLP,pages1576–1586.◦ Comparesentencepairsusinga“multiplicityofperspectives”◦ Twocomponents:sentencemodel andsimilaritymeasurementlayer◦ Advantages:◦ Donotusesyntaxparsers◦ Donotneedunsupervisedpre-trainingstep

    6

  • Multi-PerspectiveCNNArchitecture

    SentenceModel

    7

  • PreparingInput◦ UseGloVe (840Btokens,2.2Mvocab,300dvectors)tocreatesentenceembedding◦ UsevaluesfromNormal(0,1) forwordsnotfound invocab◦ Padsentenceembedding tocreateuniformly-sizedbatchesforfasterGPUtraining

    Agroupofkidsisplayinginayardandanoldmanisstandinginthebackground

    Agroupofboysinayardisplayingandamanisstandinginthebackground

    8

  • SentenceModelling:Multi-PerspectiveConvolution

    Holisticfilters Per-dimensionalfilters

    Twotypesofconvolution foreachsentence

    9

  • SentenceModeling:MultiplePoolingMultiple typesofpooling fortypeofconvolution, wecallthegroupoffiltersforaparticularconvolution typea“Block”

    10

  • SentenceModeling:MultipleWindowSizesMultipleblocks,eachcorresponding toaparticularwidth

    ws =1

    ws =2

    ws =3

    Aspecialws =∞correspondswiththeentiresentence

    11

  • SentenceModelling:Puttingittogether

    12

  • SentenceModelling:Puttingittogether

    13

  • SimilarityMeasurementLayer◦ Wecanflattentheoutputsfromthedifferentblocksintoa1Dvectorandcomparetheresult

    ◦ Problem:differentpartsoftheflattenedvectorrepresentdifferentresults,socomparingflattenedvectormightcapturelessinformation

    ◦ Instead,wecancompareovernon-flattenedlocalregions

    14

  • LocalRegionComparisonsHorizontalcomparison:comparing localregionsofthetwosentencesbasedonmatchingpoolingmethodandwindowsizeforholisticfiltersonly.CompareusingcosinedistanceandEuclideandistance.

    Verticalcomparison:Similar,butinverticaldirectionforbothholisticandper-dimension filters.Compareusingcosinedistance,Euclideandistance,andelement-wiseabsolutevalue.

    15

  • OtherModelDetails◦ Fully-ConnectedLayer:Aftersimilaritymeasurement,addtwolinearlayerswithtanh activationlayerinbetween

    ◦ Finallayerislog-softmax layer

    16

  • Re-Implementation◦ModelusedinthepaperwaswritteninTorch◦ Re-implementmodelinPyTorch asapartofwidereffortsinresearchgroup◦Makesomechangestothenetworkandcompareperformance

    17

  • Datasetsforexperiments◦ SICK◦ SentencesInvolvingCompositionalKnowledge◦ 9927sentencepairs– 4500training,500dev,4927testing◦ Scoresareinrange[1,5]

    ◦ MSRVID◦ MicrosoftVideoParaphraseCorpus◦ 1500sentencepairs– 750training,750testing◦ Sincenodev setisprovided,~20%ofthetrainingdataisheldoutforvalidationineachepoch

    ◦ Scoresareinrange[0,5]

    18

  • Training◦Use300spatialfiltersand20per-dimensionfilters◦ BothdatasetsaretrainedusingAdam,usingKL-divergencelosswithL2regularizationpenaltyof0.001◦Usebatchsizeof64forSICK,16forMSRVID◦ Learningrate:initially,0.1,butdecreasesbyafactorof~3ifvalidationperformancedonotimproveafter2epochs(reducelearningrateonplateau)◦ Shuffletrainingdataaftereveryepoch

    19

  • LearningCurve

    TrainingsetlossforSICKdataset DevsetlossforSICKdataset*Note:trainingsetloss isshowingsummed loss overbatches,dev setlossisshowing averagelossperbatch.Duetooversight.Ididnothavetimebeforethepresentationtomakethemconsistent.

    20

  • Evaluationmetriccurve

    Pearson’srondev set

    21

  • BenchmarkofRe-Implementation

    r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083

    Paper 0.8686 0.8047

    Re-impl. 0.8553 0.7905

    r

    Beltagy etal.(2014) 0.8300

    Bär etal.(2012) 0.8730

    Šarić etal.(2012) 0.8803

    Paper 0.9090

    Re-impl. 0.8668

    SICKDataset MSRVIDDataset

    rreferstoPearson’sr𝜌 referstoSpearman’s𝜌

    22

  • Modification1:Dropout

    r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083

    Paper 0.8686 0.8047

    Re-impl.w/modif. 0.8590 0.7917

    r

    Beltagy etal.(2014) 0.8300

    Bär etal.(2012) 0.8730

    Šarić etal.(2012) 0.8803

    Paper 0.9090

    Re-impl.w/modif. 0.8788

    SICKDataset MSRVIDDataset

    Usingdropoutprobability =0.5

    +0.0037 +0.0012 +0.012

    23

  • Modification2:BatchRenormalization

    r

    0.8604

    MSRVIDDataset

    r 𝜌0.8016 0.7415

    SICKDataset

    Unfortunatelybatchnormalization didnotimprovetheperformancewiththedefaultparameters

    24

  • Modification3:SymmetricCompareUnit

    r 𝜌2-layerBidirectionalLSTM 0.8488 0.7926Taiet al(2015)Const.LSTM 0.8491 0.7873Taiet al(2015)Dep.LSTM 0.8676 0.8083

    Paper 0.8686 0.8047

    Re-impl.w/modif. 0.8565 0.7883

    r

    Beltagy etal.(2014) 0.8300

    Bär etal.(2012) 0.8730

    Šarić etal.(2012) 0.8803

    Paper 0.9090

    Re-impl.w/modif. 0.8741

    SICKDataset MSRVIDDataset

    Comparedwithaddingdropoutasbaseline, thisdidnotimproveperformance

    -0.0035 -0.0034 -0.0047

    25

  • RandomizedGridSearch

    testandval metricsshowPearson’sr.FoundbetterperformanceforMSRVIDdataset.Asanimprovement, cantrypickingfromarandomsetofreasonablediscreteparametersinstead.ThankstoSalmanMohammedforrandomizedhyperparameter searchscript.

    +0.001

    26

  • WorkinProgress◦ Addingattentionmoduleinparallelwithconvolutionlayers(Yinetal.,2016)◦ Addingsparsefeatures(e.g.idf)tofirstlinearlayer◦ Evaluateperformanceonothertasks◦ TrecQA forquestionanswering◦ SNLIforinference(contradiction,entailment,neutral)

    27

  • References◦ HuaHe,KevinGimpel,andJimmyLin.2015.Multi-Perspectivesentencesimilaritymodeling withconvolutionalneuralnetworks.In ProceedingsofEMNLP,pages1576–1586.

    ◦ RonanCollobert andJasonWeston.2008.Aunifiedarchitecturefornaturallanguageprocessing:deepneuralnetworkswithmultitasklearning.InProceedingsofthe25thInternationalConferenceonMachine learning,pages160–167.

    ◦ Nal Kalchbrenner,EdwardGrefenstette,andPhilBlunsom.2014.Aconvolutionalneuralnetworkformodelling sentences.InProceedingsofthe52ndAnnualMeetingoftheAssociation forComputationalLinguistics.

    ◦ YoonKim.2014.Convolutionalneuralnetworksforsentenceclassification.InProceedingsofthe2014ConferenceonEmpiricalMethodsforNaturalLanguageProcessing.

    ◦ KaiShengTai,RichardSocher,andChristopherD.Manning.2015.Improvedsemanticrepresentationsfromtree-structuredlongshort-termmemorynetworks.InProceedingsofthe53rdAnnualMeetingoftheAssociation forComputational Linguistics.

    28

  • ReferencesCont’ed◦ Xiaodan Zhu,Parinaz Sobhani,andHongyu Guo.2015.Longshort-termmemoryoverrecursivestructures.InProceedingsofthe32ndInternationalConferenceonMachineLearning,pages1604–1612.

    ◦ DanielBar,ChrisBiemann,Iryna Gurevych,and¨ Torsten Zesch.2012.UKP:computing semantictextualsimilaritybycombiningmultiplecontentsimilaritymeasures.InProceedingsoftheFirstJointConferenceonLexicalandComputational Semantics,pages435–440.

    ◦ Frane Sariˇc,GoranGlava ´ s,Mladen Karan,JanˇSnajder, ˇandBojana Dalbelo Basiˇc.2012.TakeLab:systems´ formeasuringsemantictextsimilarity.InProceedingsoftheFirstJointConferenceonLexicalandComputationalSemantics,pages441–448.

    ◦ IslamBeltagy,Katrin Erk,andRaymondMooney.2014.Probabilisticsoftlogicforsemantictextualsimilarity.Proceedingsof52ndAnnualMeetingoftheAssociation forComputational Linguistics,pages1210–1219.

    ◦ Wenpeng Yin,Hinrich Schütze,BingXiang,andBowenZhou.Abcnn:Attention-basedconvolutionalneuralnetworkformodelingsentencepairs.InACL,2016.

    29