cs388: natural language processing lecture 6: neural...

40
CS388: Natural Language Processing Lecture 6: Neural Networks Greg Durrett

Upload: others

Post on 17-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

CS388:NaturalLanguageProcessingLecture6:NeuralNetworks

GregDurrett

Page 2: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

Administrivia

‣Mini1graded,postedonCanvas

‣ Project1duein9days

‣ XiYe(88.0F1),QuangDuong(87.3F1),UdayKusupaQ(87.2F1) 6studentsinthe86range,restare85orbelow‣ TestF1s<<devF1‣ Changingthresholds/imbalancedclassificaQon‣ POS/chunkfeatures

‣ SmallbugfixedinBadNerModel(noimpactonthecodeyouwrite)

‣ Someonegot86.3withonly7featurestotal,classifierisadicQonary

Page 3: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ThisLecture

‣ Feedforwardneuralnetworks+backpropagaQon

‣ Neuralnetworkbasics

‣ ApplicaQons

‣ Neuralnetworkhistory

‣ Beamsearch:inafewlectures

‣ ImplemenQngneuralnetworks(ifQme)

Page 4: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshic-reduceparser,notSOTA

Page 5: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenQalCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionQedSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

Page 6: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

2014:Stuffstartsworking

‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMsworkforNLP?)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaQon/senQment(convnetsworkforNLP?)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManningtransiQon-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)

Page 7: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

Whydidn’ttheyworkbefore?‣ Datasetstoosmall:forMT,notreallybenerunQlyouhave1M+parallelsentences(andreallyneedalotmore)

‣Op,miza,onnotwellunderstood:goodiniQalizaQon,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box

‣ Regulariza,on:dropoutisprenyhelpful

‣ Inputs:needwordrepresentaQonstohavetherightconQnuoussemanQcs

‣ Computersnotbigenough:can’trunforenoughiteraQons

Page 8: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetBasics

Page 9: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks

‣ HowcanwedononlinearclassificaQon?Kernelsaretooslow…

‣WanttolearnintermediateconjuncQvefeaturesoftheinput

argmaxyw>f(x, y)‣ LinearclassificaQon:

themoviewasnotallthatgood

I[containsnot&containsgood]

Page 10: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncQon

‣ Inputs

‣ Output

Page 11: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Page 12: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

Page 13: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks:XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

themoviewasnotallthatgood

Page 14: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks

Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

Page 15: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworks

Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

Page 16: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

DeepNeuralNetworks

Adopted from Chris Dyer

}outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input Second Layer

FirstLayer

“Feedforward”computaQon(notrecurrent)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

Page 17: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

DeepNeuralNetworks

Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Page 18: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

FeedforwardNetworks,BackpropagaQon

Page 19: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

LogisQcRegressionwithNNsP (y|x) = exp(w>f(x, y))P

y0 exp(w>f(x, y0))‣ Singlescalarprobability

P (y|x) = softmax

�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ socmax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

Page 20: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworksforClassificaQon

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

socmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

Page 21: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Page 22: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

CompuQngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ LookslikelogisQcregressionwithzasthefeatures!

i

j

{

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

W

Page 23: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NeuralNetworksforClassificaQon

V socmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

Page 24: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

CompuQngGradients:BackpropagaQonz = g(V f(x))

AcQvaQonsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=m dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

Page 25: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

BackpropagaQon:Picture

V socmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

‣ Canforgeteverythingacerz,treatitastheoutputandkeepbackpropping

Page 26: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

BackpropagaQon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisQcregressionwithhiddenlayerzasfeaturevector

‣ CancomputederivaQveoflosswithrespecttoztoforman“errorsignal”forbackpropagaQon

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaQon

‣ NeedtorememberthevaluesfromtheforwardcomputaQon

Page 27: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ApplicaQons

Page 28: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NLPwithFeedforwardNetworks

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣WeightmatrixlearnsposiQon-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

Page 29: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NLPwithFeedforwardNetworks

‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncQons

Bothaetal.(2017)

Page 30: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NLPwithFeedforwardNetworks‣MulQlingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbener

Page 31: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

SenQmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

Page 32: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

SenQmentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

Page 33: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

CoreferenceResoluQon‣ FeedforwardnetworksidenQfycoreferencearcs

ClarkandManning(2015),Wisemanetal.(2015)

PresidentObamasigned…

Helatergaveaspeech…

?

Page 34: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ImplementaQonDetails

Page 35: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ComputaQonGraphs

‣ CompuQnggradientsishard!

‣ AutomaQcdifferenQaQon:instrumentcodetokeeptrackofderivaQves

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ ComputaQonisnowsomethingweneedtoreasonaboutsymbolically

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

Page 36: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ComputaQonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

Page 37: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

ComputaQonGraphsinPytorch

P (y|x) = softmax(Wg(V f(x)))

ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

Page 38: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

TrainingaModel

DefineacomputaQongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradientsandtakestep

Page 39: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

Batching

‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaQons

‣ NeedtomakethecomputaQongraphprocessabatchatthesameQme

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100ocenworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

Page 40: CS388: Natural Language Processing Lecture 6: Neural …gdurrett/courses/fa2018/lectures/lec6-1pp.pdf‣ Collobert and Weston 2011: “NLP (almost) from scratch” ‣ Feedforward

NextTime‣MoreimplementaQondetails:pracQcaltrainingtechniques

‣WordrepresentaQons/wordvectors

‣ word2vec,GloVe