natural language processing with deep learning cs224n/ling284€¦ · • simpler notation, more...
TRANSCRIPT
![Page 1: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/1.jpg)
Natural Language Processingwith Deep Learning
CS224N/Ling284
Richard Socher
Lecture 2: Word Vectors
![Page 2: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/2.jpg)
Organization
• PSet 1isreleased.CodingSession1/22:(Monday,PA1dueThursday)
• SomeofthequestionsfromPiazza:• sharingthechoose-your-ownfinalprojectwithanotherclass
seemsfine-->Yes*• Buthowaboutthedefaultfinalproject?Canthatalsobeused
asafinalprojectforadifferentcourse?-->Yes*• Areweallowingstudentstobringonesheetofnotesforthe
midterm?-->Yes• AzurecomputingresourcesforProjects/PSet4.Partofmilestone
1/11/182
![Page 3: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/3.jpg)
LecturePlan
1. Wordmeaning(15mins)2. Word2vecintroduction(20mins)3. Word2vecobjectivefunctiongradients(25mins)4. Optimizationrefresher(10mins)
1/11/183
![Page 4: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/4.jpg)
1.Howdowerepresentthemeaningofaword?
Definition:meaning (Websterdictionary)
• theideathatisrepresentedbyaword,phrase,etc.
• theideathatapersonwantstoexpressbyusingwords,signs,etc.
• theideathatisexpressedinaworkofwriting,art,etc.
Commonestlinguisticwayofthinkingofmeaning:
signifier(symbol)⟺ signified(ideaorthing)
=denotation 1/11/184
![Page 5: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/5.jpg)
Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).
[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good
e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:
1/11/185
![Page 6: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/6.jpg)
ProblemswithresourceslikeWordNet
• Greatasaresourcebutmissingnuance
• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.
• Missingnewmeaningsofwords
• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest
• Impossibletokeepup-to-date!
• Subjective
• Requireshumanlabortocreateandadapt
• Hardtocomputeaccuratewordsimilarityà1/11/186
![Page 7: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/7.jpg)
Representingwordsasdiscretesymbols
IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel
Wordscanberepresentedbyone-hot vectors:
motel=[000000000010000]hotel=[000000010000000]
Vectordimension=numberofwordsinvocabulary(e.g.500,000)
Meansone1,therest0s
1/11/187
![Page 8: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/8.jpg)
Problemwithwordsasdiscretesymbols
Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.
But:motel=[000000000010000]hotel=[000000010000000]
Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!
Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves
Sec. 9.2.2
1/11/188
![Page 9: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/9.jpg)
Representingwordsbytheircontext
• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by
• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)
• OneofthemostsuccessfulideasofmodernstatisticalNLP!
• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).
• Usethemanycontextsofw tobuilduparepresentationofw
…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…
…India has just given its banking system a shot in the arm…
Thesecontextwordswillrepresentbanking 1/11/189
![Page 10: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/10.jpg)
Wordvectors
Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.
Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.
linguistics=
0.2860.792−0.177−0.1070.109−0.5420.3490.271
1/11/1810
![Page 11: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/11.jpg)
2.Word2vec:Overview
Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:
• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword
c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate
theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability
1/11/1811
![Page 12: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/12.jpg)
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1812
![Page 13: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/13.jpg)
Word2VecOverview
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑤78;|𝑤7
𝑃 𝑤78<|𝑤7
𝑃 𝑤7=;|𝑤7
𝑃 𝑤7=<|𝑤7
1/11/1813
![Page 14: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/14.jpg)
Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.
𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�
=IJ9JI9KL
M
7N;
Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:
𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −
1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy
Likelihood=
𝜃 isallvariablestobeoptimized
sometimescalledcost orlossfunction
1/11/1814
![Page 15: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/15.jpg)
Word2vec:objectivefunction
• Wewanttominimizetheobjectivefunction:
𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃
�
=IJ9JI9KL
M
7N;
• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?
• Answer: Wewillusetwovectorsperwordw:
• 𝑣U whenw isacenterword
• 𝑢U whenw isacontextword
• Thenforacenterwordc andacontextwordo:
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815
![Page 16: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/16.jpg)
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_Y`abIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_Y`abIc, 𝑣de7Y, 𝜃
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢`peqder|𝑣de7Y
𝑃 𝑢Z_dcdc|𝑣de7Y
𝑃 𝑢7seder|𝑣de7Y
𝑃 𝑢^_Y`abIc|𝑣de7Y
1/11/1816
![Page 17: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/17.jpg)
Word2VecOverviewwithVectors
• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1817
![Page 18: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/18.jpg)
Word2vec:predictionfunction
𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)
∑ exp(𝑢UM 𝑣Z)�U∈]
• Thisisanexampleofthesoftmax functionℝe → ℝe
softmax 𝑥d = exp(𝑥d)
∑ exp(𝑥9)e9N;
= 𝑝d
• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning
Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability
Aftertakingexponent,normalizeoverentirevocabulary
1/11/1818
![Page 19: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/19.jpg)
Totrainthemodel:Computeall vectorgradients!
• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:
• Remember:everywordhastwovectors• Wethenoptimizetheseparameters
1/11/1819
![Page 20: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/20.jpg)
3.Derivationsofgradient
• Whiteboard– seevideoifyou’renotinclass;)
• ThebasicLegopiece
• Usefulbasics:
• Ifindoubt:writeoutwithindices
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
1/11/1820
![Page 21: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/21.jpg)
ChainRule
• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:
• Simpleexample:
1/11/1821
![Page 22: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/22.jpg)
InteractiveWhiteboard Session!
Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:
Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.
1/11/1822
![Page 23: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/23.jpg)
Calculatingallgradients!
• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall
parametersthatarebeingusedinthatwindow.Forexample:
…crisesbankingintoturningproblems… as
centerwordatpositiont
outsidecontextwordsinwindowofsize2
outsidecontextwordsinwindowofsize2
𝑃 𝑢Z_dcbc|𝑣`peqder
𝑃 𝑢pc|𝑣`peqder
𝑃 𝑢de7Y|𝑣`peqder
𝑃 𝑢7s_eder|𝑣`peqder
1/11/1823
![Page 24: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/24.jpg)
Word2vec:Moredetails
Whytwovectors?à Easieroptimization.Averagebothattheend.
Twomodelvariants:1. Skip-grams(SG)
Predictcontext(”outside”)words(positionindependent)givencenterword
2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords
Thislecturesofar:Skip-grammodel
Additionalefficiencyintraining:1. Negativesampling
Sofar:Focusonnaïvesoftmax (simplertrainingmethod)
1/11/1824
![Page 25: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/25.jpg)
GradientDescent
• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake
smallstepindirectionofnegativegradient.Repeat.
Note:Ourobjectivesarenotconvexlikethis:(
1/11/1825
![Page 26: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/26.jpg)
Intuition
Forasimpleconvexfunctionovertwoparameters.
Contourlinesshowlevelsofobjectivefunction
1/11/1826
![Page 27: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/27.jpg)
• Updateequation(inmatrixnotation):
• Updateequation(forsingleparameter):
• Algorithm:
GradientDescent
𝛼 =stepsizeorlearningrate
1/11/1827
![Page 28: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/28.jpg)
StochasticGradientDescent
• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute
• Youwouldwaitaverylongtimebeforemakingasingleupdate!
• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.
• Algorithm:
1/11/1828
![Page 29: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/29.jpg)
1/11/1829
![Page 30: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/30.jpg)
PSet1:Theskip-grammodelandnegativesampling
• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)
• Overallobjectivefunction(theymaximize):
• Thesigmoidfunction!(we’llbecomegoodfriendssoon)
• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà
1/11/1830
![Page 31: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/31.jpg)
PSet1:Theskip-grammodelandnegativesampling
• Simplernotation,moresimilartoclassandPSet:
• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,
minimizeprob.thatrandomwordsappeararoundcenterword
• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).
• Thepowermakeslessfrequentwordsbesampledmoreoften
1/11/1831
![Page 32: Natural Language Processing with Deep Learning CS224N/Ling284€¦ · • Simpler notation, more similar to class and PSet: • We take k negative samples. • Maximize probability](https://reader034.vdocuments.mx/reader034/viewer/2022042222/5ec8851e6194c27e9f653035/html5/thumbnails/32.jpg)
PSet1:Thecontinuousbagofwordsmodel
• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel
• Tomakeassignmentslightlyeasier:
ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.
1/11/1832