cs388: natural language processing lecture 6: neural...

CS388:NaturalLanguageProcessingLecture6:NeuralNetworks

GregDurrett

Administrivia

‣Mini1graded,postedonCanvas

‣ Project1duein9days

‣ XiYe(88.0F1),QuangDuong(87.3F1),UdayKusupaQ(87.2F1) 6studentsinthe86range,restare85orbelow‣ TestF1s<<devF1‣ Changingthresholds/imbalancedclassificaQon‣ POS/chunkfeatures

‣ SmallbugfixedinBadNerModel(noimpactonthecodeyouwrite)

‣ Someonegot86.3withonly7featurestotal,classifierisadicQonary

ThisLecture

‣ Feedforwardneuralnetworks+backpropagaQon

‣ Neuralnetworkbasics

‣ ApplicaQons

‣ Neuralnetworkhistory

‣ Beamsearch:inafewlectures

‣ ImplemenQngneuralnetworks(ifQme)

History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998

‣ LSTMs:HochreiterandSchmidhuber(1997)

‣ Henderson(2003):neuralshic-reduceparser,notSOTA

2008-2013:Aglimmeroflight…

‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenQalCRFs(“neuralCRF”)

‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionQedSOTA

‣ Socher2011-2014:tree-structuredRNNsworkingokay

‣ Krizhevskeyetal.(2012):AlexNetforvision

2014:Stuffstartsworking

‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMsworkforNLP?)

‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaQon/senQment(convnetsworkforNLP?)

‣ 2015:explosionofneuralnetsforeverythingunderthesun

‣ ChenandManningtransiQon-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)

Whydidn’ttheyworkbefore?‣ Datasetstoosmall:forMT,notreallybenerunQlyouhave1M+parallelsentences(andreallyneedalotmore)

‣Op,miza,onnotwellunderstood:goodiniQalizaQon,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box

‣ Regulariza,on:dropoutisprenyhelpful

‣ Inputs:needwordrepresentaQonstohavetherightconQnuoussemanQcs

‣ Computersnotbigenough:can’trunforenoughiteraQons

NeuralNetBasics

NeuralNetworks

‣ HowcanwedononlinearclassificaQon?Kernelsaretooslow…

‣WanttolearnintermediateconjuncQvefeaturesoftheinput

argmaxyw>f(x, y)‣ LinearclassificaQon:

themoviewasnotallthatgood

I[containsnot&containsgood]

NeuralNetworks:XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncQon

‣ Inputs

‣ Output

NeuralNetworks:XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

NeuralNetworks:XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

NeuralNetworks:XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

themoviewasnotallthatgood

NeuralNetworks

Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

Linear model: y = w · x+ b

y = g(w · x+ b)y = g(Wx+ b)

NeuralNetworks


Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!

DeepNeuralNetworks

Adopted from Chris Dyer

}outputoffirstlayer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input Second Layer

FirstLayer

“Feedforward”computaQon(notrecurrent)

z = V(Wx+ b) + c

Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?

DeepNeuralNetworks


FeedforwardNetworks,BackpropagaQon

LogisQcRegressionwithNNsP (y|x) = exp(w>f(x, y))P

y0 exp(w>f(x, y0))‣ Singlescalarprobability

P (y|x) = softmax

�[w>f(x, y)]y2Y

� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)

softmax(p)i =exp(pi)Pi0 exp(pi0)

‣ socmax:expsandnormalizesagivenvector

P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]

P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer

NeuralNetworksforClassificaQon

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

socmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classes

probs

TrainingNeuralNetworks

‣Maximizeloglikelihoodoftrainingdata

‣ i*:indexofthegoldlabel

‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � log

X

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

CompuQngGradients

‣ GradientwithrespecttoW

ifi=i*zj � P (y = i|x)zj

�P (y = i|x)zj

@

@WijL(x, i⇤) =

zj � P (y = i|x)zj

�P (y = i|x)zj otherwise

‣ LookslikelogisQcregressionwithzasthefeatures!

i

j

{


X

j

exp(Wz) · ej

W

NeuralNetworksforClassificaQon

V socmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@Wz

CompuQngGradients:BackpropagaQonz = g(V f(x))

AcQvaQonsathiddenlayer

‣ GradientwithrespecttoV:applythechainrule

err(root) = ei⇤ � P (y|x)dim=m dim=d

@L(x, i⇤)@z

= err(z) = W>err(root)


X

j

exp(Wz) · ej

[somemath…]

@L(x, i⇤)@Vij

=@L(x, i⇤)

@z

@z

@Vij

BackpropagaQon:Picture

V socmaxWf(x)

zg P

(y|x)


@L@W err(root)err(z)

z

‣ Canforgeteverythingacerz,treatitastheoutputandkeepbackpropping

BackpropagaQon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisQcregressionwithhiddenlayerzasfeaturevector

‣ CancomputederivaQveoflosswithrespecttoztoforman“errorsignal”forbackpropagaQon

‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaQon

‣ NeedtorememberthevaluesfromtheforwardcomputaQon

ApplicaQons

NLPwithFeedforwardNetworks

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample

emb(interest)

emb(rates)‣WeightmatrixlearnsposiQon-dependent

processingofthewords

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

NLPwithFeedforwardNetworks

‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncQons

Bothaetal.(2017)

NLPwithFeedforwardNetworks‣MulQlingualtaggingresults:

Bothaetal.(2017)

‣ GillickusedLSTMs;thisissmaller,faster,andbener

SenQmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput

Iyyeretal.(2015)

SenQmentAnalysis

{

{Bag-of-words

TreeRNNs/CNNS/LSTMS

WangandManning(2012)

Kim(2014)

Iyyeretal.(2015)

CoreferenceResoluQon‣ FeedforwardnetworksidenQfycoreferencearcs

ClarkandManning(2015),Wisemanetal.(2015)

PresidentObamasigned…

Helatergaveaspeech…

?

ImplementaQonDetails

ComputaQonGraphs

‣ CompuQnggradientsishard!

‣ AutomaQcdifferenQaQon:instrumentcodetokeeptrackofderivaQves

y = x * x (y,dy) = (x * x, 2 * x * dx)codegen

‣ ComputaQonisnowsomethingweneedtoreasonaboutsymbolically

‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch

ComputaQonGraphsinPytorch


class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)

def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))

‣ Defineforwardpassfor

ComputaQonGraphsinPytorch


ffnn = FFNN()

loss.backward()

probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

def make_update(input, gold_label):

ffnn.zero_grad() # clear gradient variables

ei*: one-hot vector of the label (e.g., [0, 1, 0])

TrainingaModel

DefineacomputaQongraph

Foreachepoch:

Computelossonbatch

Foreachbatchofdata:

Decodetestset

Autogradtocomputegradientsandtakestep

Batching

‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaQons

‣ NeedtomakethecomputaQongraphprocessabatchatthesameQme

probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))

...

‣ Batchsizesfrom1-100ocenworkwell

def make_update(input, gold_label)

# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]

...

NextTime‣MoreimplementaQondetails:pracQcaltrainingtechniques

‣WordrepresentaQons/wordvectors

‣ word2vec,GloVe

cs388: natural language processing lecture 6: neural...

Documents